Towards Recognizing New Semantic Concepts in New Visual Domains
迈向在新视觉领域识别新语义概念


Department of Computer, Control and Management Engineering "Antonio Ruberti", Sapienza - University of Rome
罗马大学“安东尼奥·鲁贝蒂”计算机、控制与管理工程系

Dottorato di Ricerca in Ingegneria Informatica – XXXII Ciclo
计算机工程博士学位 - 第三十二届

Candidate
候选人

Massimiliano Mancini
马西米利亚诺·曼奇尼

ID number 1646014 Thesis Advisor Co-Advisor Prof. Barbara Caputo Prof. Elisa Ricci
学号 1646014 论文导师 联合导师 芭芭拉·卡普托教授 伊莉莎·里奇教授
To Monte Santa Maria Tiberina, my home.
献给蒙特圣玛丽亚蒂贝里纳,我的故乡。

Abstract
摘要


Deep learning is the leading paradigm in computer vision. However, deep models heavily rely on large scale annotated datasets for training. Unfortunately, labeling data is a costly and time-consuming process and datasets cannot capture the infinite variability of the real world. Therefore, deep neural networks are inherently limited by the restricted visual and semantic information contained in their training set. In this thesis, we argue that it is crucial to design deep neural architectures that can operate in previously unseen visual domains and recognize novel semantic concepts. In the first part of the thesis, we describe different solutions to enable deep models to generalize to new visual domains, by transferring knowledge from a labeled source domain(s) to a domain (target) where no labeled data are available. We first address the problem of unsupervised domain adaptation assuming that both source and target datasets are available but as mixtures of multiple latent domains. In this scenario, we propose to discover the multiple domains by introducing in the deep architecture a domain prediction branch and to perform adaptation by considering a weighted version of batch-normalization (BN). We also show how variants of this approach can be effectively applied to other scenarios such as domain generalization and continuous domain adaptation, where we have no access to target data but we can exploit either multiple sources or a stream of target images at test time. Finally, we demonstrate that deep models equipped with graph-based BN layers are effective in predictive domain adaptation, where information about the target domain is available only in the form of metadata. In the second part of the thesis, we show how to extend the knowledge of a pre-trained deep model incorporating new semantic concepts, without having access to the original training set. We first consider the problem of adding new tasks to a given network and we show that using simple task-specific binary masks to modify the pre-trained filters suffices to achieve performance comparable to those of task-specific models. We then focus on the open-world recognition scenario, where we are interested not only in learning new concepts but also in detecting unseen ones, and we demonstrate that end-to-end training and clustering are fundamental components to address this task. Finally, we study the problem of incremental class learning in semantic segmentation and we discover that the performances of standard approaches are hampered by the fact that the semantic of the background changes across different learning steps. We then show that a simple modification of standard entropy-based losses can largely mitigate this problem. In the final part of the thesis, we tackle a more challenging problem: given images of multiple domains and semantic categories (with their attributes), how to build a model that recognizes images of unseen concepts in unseen domains? We also propose an approach based on domain and semantic mixing of inputs and features, which is a first, promising step towards solving this problem.
深度学习是计算机视觉领域的主流范式。然而,深度模型在训练时严重依赖大规模标注数据集。不幸的是,数据标注是一个成本高昂且耗时的过程,而且数据集无法涵盖现实世界的无限变化。因此,深度神经网络本质上受到其训练集中有限的视觉和语义信息的限制。在本论文中,我们认为设计能够在先前未见的视觉领域中运行并识别新语义概念的深度神经架构至关重要。在论文的第一部分,我们描述了不同的解决方案,通过将知识从有标注的源领域转移到没有标注数据的目标领域,使深度模型能够泛化到新的视觉领域。我们首先解决无监督领域自适应问题,假设源数据集和目标数据集都可用,但它们是多个潜在领域的混合。在这种情况下,我们建议在深度架构中引入一个领域预测分支来发现多个领域,并通过考虑批量归一化(BN)的加权版本来进行自适应。我们还展示了这种方法的变体如何有效地应用于其他场景,如领域泛化和连续领域自适应,在这些场景中,我们无法获取目标数据,但可以在测试时利用多个源数据或一系列目标图像。最后,我们证明配备基于图的 BN 层的深度模型在预测性领域自适应中是有效的,在这种情况下,关于目标领域的信息仅以元数据的形式提供。在论文的第二部分,我们展示了如何在不访问原始训练集的情况下,通过纳入新的语义概念来扩展预训练深度模型的知识。我们首先考虑向给定网络添加新任务的问题,并表明使用简单的特定任务二进制掩码来修改预训练滤波器足以实现与特定任务模型相当的性能。然后,我们专注于开放世界识别场景,在该场景中,我们不仅对学习新概念感兴趣,还对检测未见概念感兴趣,并证明端到端训练和聚类是解决此任务的基本组件。最后,我们研究语义分割中的增量类学习问题,发现标准方法的性能受到背景语义在不同学习步骤中发生变化这一事实的阻碍。然后,我们表明对基于标准熵的损失函数进行简单修改可以在很大程度上缓解这个问题。在论文的最后部分,我们解决一个更具挑战性的问题:给定多个领域和语义类别的图像(及其属性),如何构建一个模型来识别未见领域中的未见概念的图像?我们还提出了一种基于输入和特征的领域和语义混合的方法,这是解决该问题的一个有前景的初步步骤。

Keywords: deep learning, transfer learning, incremental learning
关键词:深度学习,迁移学习,增量学习

Acknowledgements
致谢


I would like to thank all who contributed to achieving this amazing goal. Heartfelt thanks to my advisors, Prof. Barbara Caputo, Prof. Elisa Ricci, and Samuel Rota Bulò. At the beginning of the Ph.D. I was a tenacious but pretty messy and badly organized student. Day by day, with huge patience, countless suggestions, and precious advice, they turned that student into a researcher. I am deeply thankful to Barbara, for introducing me to research, for transferring me her passion and dedication, and for teaching me that a clear plan is better than a bunch of ideas. A huge thanks to Elisa for inspiring me with her behavior, making me understand the importance of stubbornness, and how to face deadlines and pressure while always keeping the same positive attitude. A special thanks to Samuel: discussing problems and ideas with you, trying to follow your thoughts has been an amazing, advanced school for growing my scientific perspectives. All of you taught me how to identify interesting research questions, and how to think for answering them. You showed me how to make the best out of all experiences, how to celebrate successes and how to embrace and react to failures. I enjoyed every moment of this Ph.D. and I will always be grateful to you for shaping me as the researcher I am today.
我要感谢所有为实现这一惊人目标做出贡献的人。衷心感谢我的导师芭芭拉·卡普托教授、伊莉莎·里奇教授和塞缪尔·罗塔·布卢。在博士学习初期,我是一个执着但相当混乱且缺乏条理的学生。日复一日,他们以极大的耐心、无数的建议和宝贵的意见,将那个学生变成了一名研究者。我非常感谢芭芭拉,她引导我进入研究领域,传递给我她的热情和奉献精神,并教导我一个清晰的计划胜过一堆想法。非常感谢伊莉莎,她的行为激励着我,让我明白坚持的重要性,以及如何在面对截止日期和压力时始终保持积极的态度。特别感谢塞缪尔:与你讨论问题和想法,努力跟上你的思路,这对拓展我的科学视野来说是一所了不起的高级学校。你们所有人都教会了我如何识别有趣的研究问题,以及如何思考以回答这些问题。你们向我展示了如何充分利用所有经历,如何庆祝成功,以及如何面对和应对失败。我享受博士学习的每一刻,我将永远感激你们塑造了今天的我,让我成为一名研究者。

I would like to express my gratitude to Prof. Bernt Schiele and Prof. Timothy Hospedales for having taken the time to accurately read this thesis. It was a great honor for me to receive their positive and valuable feedback.
我要感谢伯恩特·席勒教授和蒂莫西·霍斯佩代尔斯教授抽出时间仔细阅读本论文。能得到他们积极而有价值的反馈,我深感荣幸。

I am grateful to Stefano Messelodi, for welcoming me to Fondazione Bruno Kessler, allowing me to work in such an engaging environment. Appreciation is also due to Hakan Karaoguz and Prof. Patric Jensfelt for hosting me in the RPL lab in Stockholm, introducing me to the challenges of robot vision. Additionally, I wish to thank Prof. Zeynep Akata and all members of the EML lab in Tübingen for showing me different perspectives and, recently, welcoming me for a new exciting experience.
我非常感谢斯特凡诺·梅塞洛迪(Stefano Messelodi),他欢迎我加入布鲁诺·凯斯勒基金会(Fondazione Bruno Kessler),让我能在如此有吸引力的环境中工作。我也感谢哈坎·卡拉奥古兹(Hakan Karaoguz)和帕特里克·延斯费尔特教授(Prof. Patric Jensfelt),他们邀请我到斯德哥尔摩的机器人感知实验室(RPL lab),让我了解到机器人视觉领域的挑战。此外,我要感谢泽内普·阿卡塔教授(Prof. Zeynep Akata)以及图宾根的经验机器学习实验室(EML lab)的所有成员,他们让我看到了不同的视角,最近还欢迎我开启了一段新的精彩体验。

This journey wouldn't have been the same without some good fellows sharing the way. I thank all members of the VANDAL lab in Rome and Turin, with a big thanks to Fabio and Dario for bearing me in my attempt to become a better supervisor. Thanks also to Fabio (the first), Paolo, Valentina, Antonio, Silvia, and Mirco for sharing with me lab life and conference adventures. I am grateful to all members of TeV and MHUG labs in Trento, with a special mention to Pilz, Simo, Swathi, Levi, Enrico, Aliaks, and Sub: thanks for sharing with me lab life, stressful and joyful times, conferences, and beers. Heartfelt thanks to all my co-authors, Lorenzo in particular for his fundamental support, smart insights, and nice moments together.
如果没有一些好朋友一路相伴,这段旅程将会截然不同。我感谢罗马和都灵的VANDAL实验室的所有成员,尤其要感谢法比奥(Fabio)和达里奥(Dario),感谢他们容忍我努力成为一名更好的导师。还要感谢法比奥(第一个法比奥)、保罗(Paolo)、瓦伦蒂娜(Valentina)、安东尼奥(Antonio)、西尔维娅(Silvia)和米尔科(Mirco),感谢他们与我分享实验室生活和参加会议的经历。我感谢特伦托的TeV实验室和MHUG实验室的所有成员,特别要提到皮尔茨(Pilz)、西莫(Simo)、斯瓦蒂(Swathi)、利维(Levi)、恩里科(Enrico)、阿利亚克斯(Aliaks)和苏布(Sub):感谢你们与我分享实验室生活、紧张和快乐的时光、会议以及啤酒。衷心感谢我所有的合著者,尤其感谢洛伦佐(Lorenzo),感谢他给予的重要支持、独到见解以及我们一起度过的美好时光。

Research is a part but not all of my life. I wish to thank all my long-time friends in Monte, with extra gratitude to Robi, Alex and Diego. Whenever I return to my hometown you always make me feel as if I have never been away. I love that feeling.
研究只是我生活的一部分,并非全部。我要感谢蒙特的所有老朋友,尤其要感谢罗比(Robi)、亚历克斯(Alex)和迭戈(Diego)。每当我回到家乡,你们总是让我感觉自己从未离开过。我喜欢这种感觉。

I would like to thank my family, from my cousins to my grandparents, for never making me feel alone. To my parents, Rinaldo and Anna: thank you for always supporting me and for the values you taught me. I do not think I can express in words how much I owe you. Thanks to my sister, Serena, for understanding me and making me always remember what really matters. I am proud of you.
我要感谢我的家人,从我的表兄弟姐妹到我的祖父母,感谢你们从未让我感到孤单。感谢我的父母,里纳尔多(Rinaldo)和安娜(Anna):感谢你们一直以来对我的支持,以及你们传授给我的价值观。我觉得言语无法表达我对你们的感激之情。感谢我的姐姐,塞雷娜(Serena),感谢你理解我,让我时刻铭记真正重要的东西。我为你感到骄傲。

Finally, I want to thank Elisa, my girlfriend. These years were not easy for us: a long distance in between, occasional stress, pressures. You have always been patient, helping me, pushing me, and believing in me far more than what I do. I love you.
最后,我要感谢我的女朋友伊丽莎(Elisa)。这些年对我们来说并不容易:我们分隔两地,偶尔会有压力和负担。你总是那么有耐心,帮助我、鼓励我,而且比我自己更相信我。我爱你。

Contents
目录


1 Introduction 1
1 引言 1

1.1 Overview 1
1.1 概述 1

1.1.1 Domain shift: generalizing to new visual domains 3
1.1.1 领域偏移:向新的视觉领域泛化 3

1.1.2 Semantic shift: breaking model's semantic limits 4
1.1.2 语义偏移:突破模型的语义限制 4

1.1.3 Recognizing unseen categories in unseen domains 5
1.1.3 在未见领域中识别未见类别 5

1.2 Contributions 6
1.2 贡献 6

1.3 Outline 9
1.3 大纲 9

1.4 Publications 10
1.4 发表成果 10

2 Recognition across New Visual Domains 12
2 跨新视觉领域的识别 12

2.1 Problem statement 13
2.1 问题陈述 13

2.2 Related Works 15
2.2 相关工作 15

2.3 Preliminaries: Domain Alignment Layers 19
2.3 预备知识:领域对齐层 19

2.4 Latent Domain Discovery 20
2.4 潜在领域发现 20

2.4.1 Problem Formulation 21
2.4.1 问题表述 21

2.4.2 Multi-domain DA-layers 23
2.4.2 多领域DA层 23

2.4.3 Domain prediction 23
2.4.3 领域预测 23

2.4.4 Training the network 24
2.4.4 网络训练 24

2.4.5 Experimental results 25
2.4.5 实验结果 25

2.4.6 Conclusions 40
2.4.6 结论 40

2.5 Domain Generalization 41
2.5 领域泛化 41

2.5.1 Problem Formulation 43
2.5.1 问题表述 43

2.5.2 Starting point: Domain Generalization with Weighted BN 43
2.5.2 起点:基于加权批量归一化(Weighted BN)的领域泛化 43

2.5.3 WBN Experiments: Domain Generalization in Semantic Place
2.5.3 加权批量归一化(WBN)实验:语义场所中的领域泛化

Categorization 45
分类 45

2.5.4 From BN to Classifiers: Best Sources Forward 50
2.5.4 从批量归一化(BN)到分类器:最佳源前向传播 50

2.5.5 Experiments: Domain Generalization in Computer Vision 52
2.5.5 实验:计算机视觉中的领域泛化 52

2.5.6 Conclusions 55
2.5.6 结论 55

2.6 Continuous Domain Adaptation 57
2.6 连续域自适应(Continuous Domain Adaptation) 57

2.6.1 The KTH Handtool Dataset 58
2.6.1 KTH 手动工具数据集(KTH Handtool Dataset) 58

2.6.2 Problem Formulation 59
2.6.2 问题描述 59

2.6.3 ONDA: ONline Domain Adaptation with Batch-Normalization 60
2.6.3 ONDA:基于批量归一化的在线域自适应(ONline Domain Adaptation with Batch-Normalization) 60

2.6.4 Experimental results 61
2.6.4 实验结果 61

2.6.5 Conclusions 64
2.6.5 结论 64

2.7 Predictive Domain Adaptation 66
2.7 预测性域自适应(Predictive Domain Adaptation) 66

2.7.1 Problem Formulation 67
2.7.1 问题描述 67

2.7.2 AdaGraph: Graph-based Predictive DA 68
2.7.2 AdaGraph:基于图的预测性域自适应(Graph-based Predictive DA) 68

2.7.3 Model Refinement through Joint Prediction and Adaptation . 71
2.7.3 通过联合预测和自适应进行模型细化 71

2.7.4 Experimental results 72
2.7.4 实验结果 72

2.7.5 Conclusions 77
2.7.5 结论 77

3 Recognizing New Semantic Concepts 78
3 识别新的语义概念 78

3.1 Problem statement 79
3.1 问题陈述 79

3.2 Related Works 81
3.2 相关工作 81

3.3 Sequential and Memory Efficient Learning of New Datasets 84
3.3 新数据集的顺序学习与内存高效学习 84

3.3.1 Problem Formulation 85
3.3.1 问题表述 85

3.3.2 Affine Weight Transformation through Binary Masks 87
3.3.2 通过二元掩码进行仿射权重变换 87

3.3.3 Learning Binary Masks 88
3.3.3 学习二元掩码 88

3.3.4 Experimental results 88
3.3.4 实验结果 88

3.3.5 Conclusions 96
3.3.5 结论 96

3.4 Incremental Learning in Semantic Segmentation 99
3.4 语义分割中的增量学习 99

3.4.1 Problem Formulation 101
3.4.1 问题表述 101

3.4.2 Modeling the Background for Incremental Learning in Seman-
3.4.2 为语义分割中的增量学习建模背景

tic Segmentation 102
语义分割 102

3.4.3 Experimental results 105
3.4.3 实验结果 105

3.4.4 Conclusions 111
3.4.4 结论 111

3.5 Open World Recognition 112
3.5 开放世界识别 112

3.5.1 Problem Formulation 113
3.5.1 问题表述 113

3.5.2 Preliminaries 114
3.5.2 预备知识 114

3.5.3 Deep Nearest Non-Outlier 115
3.5.3 深度最近非离群点 115

3.5.4 Boosting Deep Open World Recognition 117
3.5.4 提升深度开放世界识别能力 117

3.5.5 Experimental results 120
3.5.5 实验结果 120

3.5.6 Towards Autonomous Visual Systems: Web-aided OWR 128
3.5.6 迈向自主视觉系统:网络辅助的开放世界识别(OWR) 128

3.5.7 Conclusions 131
3.5.7 结论 131

4 Towards Recognizing Unseen Categories in Unseen Domains 133
4 迈向在未知领域中识别未知类别 133

4.1 Problem statement 134
4.1 问题陈述 134

4.2 Related Works 138
4.2 相关工作 138

4.3 Recognizing Unseen Categories in Unseen Domains 140
4.3 在未知领域中识别未知类别 140

4.3.1 Preliminaries 140
4.3.1 预备知识 140

4.3.2 Simulating Unseen Domains and Concepts through Mixup 140
4.3.2 通过混合(Mixup)模拟未知领域和概念 140

4.3.3 Experimental results 144
4.3.3 实验结果 144

4.3.4 Conclusions 148
4.3.4 结论 148

5 Conclusions and Future Works 149
5 结论与未来工作 149

5.1 Summary of contributions 149
5.1 贡献总结 149

5.2 Open problems and future directions 152
5.2 开放性问题与未来方向 152

A Recognition across New Visual Domains 154
A 跨新视觉领域的识别 154

A. 1 Latent Domain Discovery 154
A. 1 潜在领域发现 154

A. 1.1mDA layers formulas 154
A. 1.1mDA 层公式 154

A.1.2 Training loss progress 155
A.1.2 训练损失进展 155

A.1.3 Additional Results on PACS 156
A.1.3 关于PACS 156的额外结果

A. 2 Predictive Domain Adaptation 159
A. 2 预测性领域自适应 159

A.2.1 Metadata Details 159
A.2.1 元数据详情 159

A.2.2 Additional Analysis 159
A.2.2 额外分析 159

B Recognizing New Semantic Concepts 162
B 识别新的语义概念 162

B. 1 Incremental Learning in Semantic Segmentation 162
B. 1 语义分割中的增量学习 162

B.1.1 How should we use the background? 162
B.1.1 我们应该如何利用背景? 162

B.1.2 Per class results on Pascal-VOC 2012 163
B.1.2 在Pascal - VOC 2012上的每类结果 163

B.1.3 Validation protocol and hyper-parameters 166
B.1.3 验证协议和超参数 166

C Towards Recognizing Unseen Categories in Unseen Domains 167
C 迈向在未知领域中识别未知类别 167

C. 1 Recognizing Unseen Categories in Unseen Domains 167
C. 1 在未知领域中识别未知类别 167

C.1.1 Hyperparameter choices 167
C.1.1 超参数选择 167

C.1.2 ZSL+DG: analysis of additional baselines 168
C.1.2 ZSL + DG:额外基线分析 168

C.1.3 ZSL+DG: ablation study 169
C.1.3 ZSL + DG:消融研究 169

C.1.4 ZSL results 169
C.1.4 ZSL结果 169

Bibliography 171
参考文献 171

List of Figures
图列表


1.1 Overview of our research problem. Suppose we are given an initial training set composed of images of a set of classes (e.g. elephant, horse) acquired in a given domain (e.g. real photos). Two main discrepancies can occur at test time: either images contain the same semantics but in different domains (e.g. paintings, bottom-left) or they contain images of the same domain but depicting different semantic concepts (e.g. dog and giraffe, top-right). In the first case we talk about domain shift problem, while the second considers the semantic shift problem. The goal of this thesis is to address the two problems together (bottom-right), i.e. recognizing new semantic concepts (e.g. dog, giraffe) in new visual domains (paintings). 2
1.1 我们的研究问题概述。假设我们有一个初始训练集,它由在给定领域(例如真实照片)中获取的一组类别(例如大象、马)的图像组成。在测试时可能会出现两个主要差异:要么图像具有相同的语义,但处于不同的领域(例如绘画,左下角);要么它们是同一领域的图像,但描绘了不同的语义概念(例如狗和长颈鹿,右上角)。在第一种情况下,我们讨论的是领域偏移问题,而第二种情况考虑的是语义偏移问题。本论文的目标是同时解决这两个问题(右下角),即在新的视觉领域(绘画)中识别新的语义概念(例如狗、长颈鹿)。2

2.1 The idea behind the proposed framework for latent domain discovery. In this section, we introduce a novel deep architecture which, given a set of images, automatically discovers multiple latent domains and use this information to align the distributions of the internal CNN feature representations of sources and target domains for the purpose of domain adaptation. In this way, more accurate target classifiers can be learned. 20
2.1 潜在领域发现所提议框架背后的理念。在本节中,我们介绍一种新颖的深度架构,给定一组图像,该架构能自动发现多个潜在领域,并利用这些信息来对齐源领域和目标领域的卷积神经网络(CNN)内部特征表示的分布,以实现领域自适应。通过这种方式,可以学习到更准确的目标分类器。20

2.2 Schematic representation of our method applied to the AlexNet architecture (left) and of an mDA-layer (right). 22
2.2 应用于AlexNet架构(左)的我们的方法以及一个多层去噪自编码器层(mDA层,右)的示意图。22

2.3 Distribution of the assignments produced by the domain prediction branch for each latent domain in all possible settings of the PACS dataset. Different colors denote different source domains. 33
2.3 在PACS数据集的所有可能设置下,领域预测分支为每个潜在领域生成的分配结果分布。不同颜色表示不同的源领域。33

2.4 Top-6 images associated to each latent domain for the different sources/ target combinations. Each row corresponds to a different latent domain. 34
2.4 不同源/目标组合下与每个潜在域相关联的前6张图像。每行对应一个不同的潜在域。34

2.5 Distribution of the assignments produced by the domain prediction branch in all possible multi-target settings of the PACS dataset. Different colors denote different source domains (red: Art, yellow: Cartoon, blue: Photo, green: Sketch). 35
2.5 PACS数据集所有可能的多目标设置下,领域预测分支生成的分配结果分布。不同颜色代表不同的源领域(红色:艺术(Art),黄色:卡通(Cartoon),蓝色:照片(Photo),绿色:素描(Sketch))。35

2.6 Distribution of the assignments produced by the domain prediction branch trained with the additional constraint on the entropy loss in all possible multi-target settings of the PACS dataset. Different colors denote different source domains (red: Art, yellow: Cartoon, blue: Photo, green: Sketch). 36
2.6 在PACS数据集所有可能的多目标设置下,通过对熵损失施加额外约束进行训练的域预测分支所生成的分配结果的分布情况。不同颜色代表不同的源域(红色:艺术(Art),黄色:卡通(Cartoon),蓝色:照片(Photo),绿色:素描(Sketch))。36
2.7 Distribution of the assignments produced by the domain prediction
2.7 由结构域预测生成的分配结果的分布

branch for each latent domain in all target settings of the Digits-
在Digits-的所有目标设置中为每个潜在域设置分支

five dataset. Different colors denote different source domains (black:
五个数据集。不同颜色表示不同的源域(黑色:

MNIST, blue: MNIST-m, green: USPS, red: SVHN, yellow: Synthetic
MNIST(手写数字数据集),蓝色:MNIST-m,绿色:USPS(美国邮政服务数据集),红色:SVHN(街景门牌号数据集),黄色:合成数据集

numbers). 37
数字)。37

2.8 Office31 dataset. Performance at varying number of domain labels
2.8 Office31数据集。不同数量的领域标签下的性能表现

(%) for source samples. 39
(%) 对于源样本。39

2.9 The domain generalization problem. At training time (orange block)
2.9 领域泛化问题。在训练阶段(橙色块)

images of multiple source domains (e.g. A,B,C) are available. These
多个源领域(例如 A、B、C)的图像是可用的。这些

images are used to train different models with parameters θi . Our ap-
图像用于使用参数θi训练不同的模型。我们的……

proach automatically computes a model D which accurately classifies
该方法会自动计算出一个能准确分类的模型D

images of a novel domain (not available during training) by combining
通过组合一个新领域(训练期间不可用)的图像

the models of the known domains. 42
已知领域的模型。42

2.10 Example of the proposed WBN framework. (a) AlexNet with BN
2.10 所提出的WBN框架示例。(a) 采用批量归一化(BN)的AlexNet网络

layers after each fully connected. (b) The same network employing
每个全连接层之后。(b) 采用该方法的同一网络

Domain Alignment layers for domain adaptation, where different BN
用于领域自适应的领域对齐层,其中不同的批量归一化(Batch Normalization,BN)

are used for source and target domains. (c) Our approach for DG
用于源域和目标域。(c) 我们针对领域泛化(DG)的方法

with WBN layers. 44
使用WBN(宽激活批量归一化)层。44

2.11 Distribution of the values of the weights computed with AlexNet+WBN
2.11 使用AlexNet+WBN计算得到的权重值分布

for the scenario Lj.N as target in Table 2.9. Different colors represent
对于表2.9中以场景Lj.N为目标的情况。不同颜色代表

different original source domains. 48
不同的原始源领域。48

2.12 Distribution of accuracy gains of AlexNet+WBN* w.r.t. AlexNet+BN
2.12 AlexNet+WBN*相对于AlexNet+BN的准确率提升分布

considering Saarbrücken as target, varying both laboratory and illu-
以萨尔布吕肯(Saarbrücken)为目标,同时改变实验室和……

mination. Colors indicate larger (blue), lower (red) and comparable
终止。颜色表示较大值(蓝色)、较小值(红色)以及相近值

(green) performances. 48
(绿色)绩效。48

2.13 Intuition behind the proposed BSF framework. Different domain-
2.13 所提出的BSF框架(Bidirectional Selective Fusion framework,双向选择性融合框架)背后的直觉。不同领域

specific classifiers and the classifiers fusion are learned at training
特定的分类器以及分类器融合是在训练时

time on source domains, in a single end-to-end trainable architecture.
在源域上,以单一的端到端可训练架构进行学习的。

When a target image is processed, our deep model optimally combines
当处理目标图像时,我们的深度模型会最优地组合

the source models in order to compute the final prediction. 51
源模型以计算最终预测结果。51

2.14 Simplified architecture of the proposed BSF framework. The input
2.14 所提出的BSF框架的简化架构。输入

image is fed to a series of domain-specific classifiers and to the domain
图像被输入到一系列特定领域的分类器和领域

prediction branch. The latter produces the assignment w which is fed
预测分支。后者产生分配结果 w,该结果被输入

to the domain prediction loss. The same w is modulated by α before
到领域预测损失中。相同的 w 在被用于组合每个分类器的输出之前,会由 α 进行调制。

being used to combine the output of each classifier. The final output
该架构的最终输出 [latex0] 被输入到分类损失中。52

of the architecture z ,is fed to the classification loss. 52

2.15 Rotated-MNIST dataset: analysis of the assignments computed by
2.15 旋转MNIST数据集(Rotated-MNIST dataset):对领域预测分支计算的分配结果的分析。54

the domain prediction branch. 54

2.16 Our ONDA approach for performing kitting in arbitrary conditions.
2.16 我们用于在任意条件下进行配套操作的ONDA方法(One-shot Non-parametric Domain Adaptation approach,一次性非参数领域自适应方法)。

Given a training set, we can train a robot vision model offline. As
给定一个训练集,我们可以离线训练一个机器人视觉模型。

the robot performs the task, we gradually adapt the visual model
当机器人执行任务时,我们逐步调整视觉模型

to the current working conditions, in an online fashion and without
以在线方式且无需考虑当前的工作条件

requiring target data during the offline training phase. 58
在离线训练阶段需要目标数据。58

2.17 The 2-arm stationary robot platform. 59
2.17 双臂固定式机器人平台。59

2.18 The statistics of the BN layers are initialized offline, by training the network on the images of the source domain. At deployment time, the input frames are processed using the global estimate of the statistics (red lines). However the robot collects each nt input frames to compute partial BN statistics, using these estimated values
2.18 BN层的统计数据通过在源域图像上训练网络进行离线初始化。在部署时,使用统计数据的全局估计值(红线)处理输入帧。然而,机器人会收集每 nt 个输入帧来计算部分BN统计数据,并使用这些估计值

to gradually update the BN statistics for the current scenario. 61
逐步更新当前场景下的批量归一化统计数据(BN statistics)。61

2.19 Experiments on isolated shifts. The labels of the x-axes denote the conditions of target domain, with the first line indicating the light condition, the second the camera and the third the background. We
2.19 孤立偏移实验。x轴的标签表示目标域的条件,第一行表示光照条件,第二行表示相机,第三行表示背景。我们

underlined the changes between the source and target domains. 63
强调了源域和目标域(source and target domains)之间的变化。63

2.20 Accuracy vs number of updates of ONDA for different values of (a) α and (b) nt in a sample scenario. The red line denotes the BN lower bound of the starting model, while the yellow line the DIAL upper bound. 64
2.20 在一个示例场景中,ONDA(在线网络动态分析,Online Network Dynamics Analysis)的准确率与更新次数的关系,分别对应 (a) α 和 (b) nt 的不同取值。红线表示起始模型的贝叶斯网络(BN,Bayesian Network)下界,而黄线表示DIAL(动态信息聚合学习,Dynamic Information Aggregation Learning)上界。64

2.21 Predictive Domain Adaptation. During training we have access to a labeled source domain (yellow block) and a set of unlabeled auxiliary domains (blue blocks), all with associated metadata. At test time, given the metadata corresponding to the unknown target domain, we predict the parameters associated to the target model. This predicted model is further refined during test, while continuously receiving data of the target domain. 67
2.21 预测性领域自适应。在训练过程中,我们可以使用一个带标签的源领域(黄色方块)和一组无标签的辅助领域(蓝色方块),所有这些领域都有相关的元数据。在测试时,给定与未知目标领域相对应的元数据,我们预测与目标模型相关的参数。这个预测模型在测试期间会进一步优化,同时持续接收目标领域的数据。67

2.22 AdaGraph framework (Best viewed in color). Each BN layer is replaced by its GBN counterpart. The parameters used in a GBN layer are computed from a given metadata and the graph. Each domain in the graph (circles) contains its specific parameters (rectangular blocks). During the training phase (blue part),a metadata (i.e. mz , blue block) is mapped to its domain (z). While the statistics of GBN are determined only by the one of z(θz) ,scale and bias are computed considering also the graph edges. During test, we receive the metadata for the target domain (mv~,red block) to which no node is linked. Thus we initialize v~ and we compute its parameters and statistics exploiting the connection with the other nodes in the graph (θv~) . 70
2.22 AdaGraph框架(建议查看彩色版本)。每个批量归一化(BN)层都被其对应的图批量归一化(GBN)层所取代。GBN层中使用的参数是根据给定的元数据和图计算得出的。图中的每个域(圆形)都包含其特定的参数(矩形块)。在训练阶段(蓝色部分),一个元数据(即 mz,蓝色块)被映射到其所在的域(z)。虽然GBN的统计信息仅由 z(θz) 之一确定,但在计算缩放因子和偏置时还会考虑图的边。在测试时,我们会收到目标域 (mv~,red block) to which no node is linked. 的元数据。因此,我们初始化 v~,并利用图中与其他节点的连接 (θv~) 来计算其参数和统计信息。70

2.23 Portraits dataset: comparison of different models in the PDA scenario with respect to the average accuracy on a target decade, fixed the same region of source and target domains. The models are based on ResNet-18. 75
2.23 肖像数据集:在源域和目标域区域固定相同的情况下,比较不同模型在PDA(部分领域适应,Partial Domain Adaptation)场景中针对目标十年的平均准确率。这些模型基于ResNet - 18。75

3.1 Idea behind our BAT approach. A network pre-trained on a given recognition task A (i.e. ImageNet) can be extended to tackle other recognition tasks B (e.g. digits) and C (e.g. traffic sign) by simply transforming the network weights (orange cubes) through task-specific binary masks (colored grids). 85
3.1 我们的BAT方法背后的理念。在给定识别任务A(即ImageNet)上预训练的网络,可以通过特定任务的二进制掩码(彩色网格)简单地转换网络权重(橙色方块),从而扩展到处理其他识别任务B(例如数字)和C(例如交通标志)。85
3.2 Overview of the proposed BAT model (best viewed in color). Given a convolutional kernel, we exploit a real-valued mask to generate a domain-specific binary mask. An affine transformation directly applied to the binary masks, which changes their range (through a scale parameter k2 ) and their minimum value (through k1 ). A multiplicative mask applied to the original kernels and the pre-trained kernel themselves are scaled by the factors k3 and k0 respectively. All the different masks are summed to produce the final domain-specific
3.2 所提出的 BAT 模型概述(彩色模式下查看效果最佳)。给定一个卷积核,我们利用一个实值掩码来生成特定领域的二值掩码。对二值掩码直接应用仿射变换,这会改变它们的范围(通过缩放参数 k2)和它们的最小值(通过 k1)。将一个乘法掩码应用于原始核,预训练核本身分别通过因子 k3k0 进行缩放。将所有不同的掩码相加,以生成最终的特定领域

kernel. 86
内核(kernel)。86

3.3 Percentage of 1s in the binary masks at different layers depth for Piggyback (left) and our full model (center) and values of the parameters k1,k2,k3 computed by our full model (right) for all datasets of the Imagenet-to-Sketch benchmark and the ResNet-50 architecture. 97
3.3 在Imagenet-to-Sketch基准测试的所有数据集和ResNet - 50架构下,搭载式模型(左)和我们的完整模型(中)在不同层深度的二值掩码中1s的百分比,以及我们的完整模型计算出的参数k1,k2,k3的值(右)。97

3.4 Percentage of 1s in the binary masks at different layers depth for Piggyback (left) and our full BAT model (center, ours) and values of the parameters k1,k2,k3 computed by our full model (right) for all datasets of the Imagenet-to-Sketch benchmark with the DenseNet-121 architecture. 98
3.4 在采用DenseNet - 121架构的ImageNet到草图基准测试的所有数据集上,搭载式(Piggyback,左)、我们的完整BAT模型(中间,我们的模型)在不同层深度的二值掩码中1s的百分比,以及我们的完整模型计算的参数k1,k2,k3的值(右)。98

3.5 Illustration of the semantic shift of the background class in incremental learning for semantic segmentation. Yellow boxes denote the ground truth provided in the learning step, while grey boxes denote classes not labeled. As different learning steps have different label spaces, at step t old classes (e.g. person) and unseen ones (e.g. car) might be labeled as background in the current ground truth. Here we show the specific case of single class learning steps, but we address the general case where an arbitrary number of classes is added. 100
3.5 语义分割增量学习中背景类语义转移的示意图。黄色框表示学习步骤中提供的真实标签,而灰色框表示未标记的类。由于不同的学习步骤具有不同的标签空间,在步骤t,旧类(例如人)和未见过的类(例如汽车)可能在当前的真实标签中被标记为背景。这里我们展示了单类学习步骤的具体情况,但我们处理的是添加任意数量类别的一般情况。100

3.6 Overview of MiB . At learning step t an image is processed by the old (top) and current (bottom) models, mapping the image to their respective output spaces. As in standard ICL methods, we apply a cross-entropy loss to learn new classes (blue block) and a distillation loss to preserve old knowledge (yellow block). In this framework, we model the semantic changes of the background across different learning steps by (i) initializing the new classifier using the weights of the old background one (left), (ii) comparing the pixel-level background ground truth in the cross-entropy with the probability of having either the background (black) or an old class (pink and grey bars) and (iii) relating the background probability given by the old model in the distillation loss with the probability of having either the background
3.6 MiB概述。在学习步骤t,图像由旧模型(上)和当前模型(下)处理,将图像映射到它们各自的输出空间。与标准的增量式持续学习(ICL)方法一样,我们应用交叉熵损失来学习新类(蓝色块),并应用蒸馏损失来保留旧知识(黄色块)。在这个框架中,我们通过以下方式对不同学习步骤中背景的语义变化进行建模:(i)使用旧背景分类器的权重初始化新分类器(左);(ii)将交叉熵中的像素级背景真实标签与具有背景(黑色)或旧类(粉色和灰色条)的概率进行比较;(iii)将蒸馏损失中旧模型给出的背景概率与具有背景

or a novel class (green bar). 101
或新类(绿色条)的概率相关联。101

3.7 Qualitative results on the 100-50 setting of the ADE20K dataset using different incremental methods. The image demonstrates the superiority of our approach on both new (e.g. building, floor, table) and old (e.g. car, wall, person) classes. From left to right: image, FT, LwF [144], ILT [178], LwF-MC [216], MiB , and the ground-truth. Best viewed in color. 110
3.7 在ADE20K数据集的100 - 50设置下使用不同增量方法的定性结果。该图像展示了我们的方法在新类(例如建筑物、地板、桌子)和旧类(例如汽车、墙壁、人)上的优越性。从左到右:图像、微调(FT)、带遗忘补偿的学习(LwF)[144]、增量式学习转换(ILT)[178]、带多分类器的带遗忘补偿的学习(LwF - MC)[216]、MiB和真实标签。彩色显示效果最佳。110
3.8 In the open-world scenario a robot must be able to classify correctly
3.8 在开放世界场景中,机器人必须能够正确分类

known objects, (apple and mug), and detect novel semantic concepts
已知物体(苹果和杯子),并检测新的语义概念

(e.g. banana). When a novel concept is detected, it should learn the
(例如香蕉)。当检测到新的概念时,它应该从辅助数据集中学习

new class from an auxiliary dataset, updating its internal knowledge. 113
新类,更新其内部知识。113

3.9 Overview of the B-DOC global to local clustering. The global cluster-
3.9 B - DOC全局到局部聚类概述。全局聚类

ing (left) pushes sample representations closer to the centroid (star)
(左)将样本表示推向它们所属类别的质心(星号)

of the class they belong to. The local clustering (right), instead,
相反,局部聚类(右)

forces the neighborhood of a sample in the representation space to be
迫使表示空间中样本的邻域在语义上保持一致,将其他类别的样本推开。

semantically consistent, pushing away samples of other classes.

3.10 Overview of how B-DOC learns the class-specific rejection thresholds.
3.10 B - DOC如何学习特定类别的拒绝阈值的概述。

The small circles represent the samples in the held out set. The
小圆圈代表保留集中的样本。

dashed circles, having radius the maximal distance (red), represent
虚线圆(半径为最大距离,红色)表示

the limits beyond which a sample is rejected as a member of that class.
样本被判定不属于该类别的界限。

As it can be seen, the class-specific threshold is learned to reduce the
可以看出,学习特定类别的阈值是为了减少

rejection errors.
拒识错误。

3.11 Comparison of NNO [15], DeepNNO and B-DOC on RGB-D Object
3.11 NNO [15]、DeepNNO和B - DOC在RGB - D物体

dataset [127]. The numbers in parenthesis denote the average accuracy
数据集 [127] 上的比较。括号中的数字表示

among the different incremental steps.
不同增量步骤中的平均准确率。

3.12 Comparison of NNO [15], DeepNNO and B-DOC on Core50 [151].
3.12 NNO [15]、DeepNNO和B - DOC在Core50 [151] 上的比较。

The numbers in parenthesis denote the average accuracy among the
括号中的数字表示

different incremental steps.
不同增量步骤中的平均准确率。

3.13 Comparison of NNO [15], DeepNNO and B-DOC on CIFAR-100
3.13 NNO [15]、DeepNNO和B - DOC在CIFAR - 100

dataset [123]. The numbers in parenthesis denote the average accuracy
数据集 [123] 上的比较。括号中的数字表示

among the different steps.
不同步骤中的平均准确率。

3.14 CIFAR-100 results in the closed world scenario.
3.14 封闭世界场景下的CIFAR - 100结果。

3.15 CIFAR-100: open world performances varying the number of known
3.15 CIFAR - 100:开放世界性能随已知

and unknown classes.
和未知类别数量的变化情况。

3.16 CIFAR-100 results of DeepNNO in the closed world scenario for
3.16 DeepNNO在封闭世界场景下的CIFAR - 100数据集结果

different values of w .
w的不同值。

3.17 CIFAR-100 results of DeepNNO in the closed world scenario for
3.17 DeepNNO在封闭世界场景下的CIFAR - 100数据集结果

different values of λ .
λ的不同值。

3.18 Overview of the open world recognition task within a robotic platform.
3.18 机器人平台内开放世界识别任务概述。

Given an image of an object, a classification algorithm assigns to it a
给定一个物体的图像,分类算法会为其赋予一个

class label. If the object is recognized as novel, the object label and
类别标签。如果该对象被识别为新类别,那么对象标签和

relative are obtained through external resource (e.g. a human and/or
亲属关系是通过外部资源(例如人类和/或)获得的

the Web). Finally, the images are used to incrementally updated the
网络)。最后,这些图像用于逐步更新

knowledge base of the robot.
机器人的知识库。

3.19 CIFAR-100: performances of Web-aided OWR in the open world
3.19 CIFAR - 100:网络辅助的开放世界持续学习(Web - aided OWR)在开放世界中的性能表现

scenario, with 50 unknown classes.
场景,包含50个未知类别。

3.20 Core50 dataset: performances of Web-aided OWR in the open world
3.20 Core50数据集:网络辅助开放世界持续学习(Web-aided OWR)在开放世界中的性能表现

scenario, with 5 unknown classes.
场景,包含5个未知类别。

3.21 Qualitative results of deployment of DeepNNO on a robotic platform.
3.21 在机器人平台上部署深度神经网络优化器(DeepNNO)的定性结果。

The robot recognizes an object as unknown (i.e. the red hammer,
机器人将某个物体识别为未知物体(即红色锤子)

bottom) and adds it to the knowledge base through the incremental
底部),并通过增量式

learning procedure (top right).
学习过程(右上角)将其添加到知识库中。

118

119

121

122

123

125

125

125

125

129

130

130

131
4.1 Our ZSL+DG problem. During training we have images of multiple
4.1 我们的零样本学习+领域泛化(ZSL+DG)问题。在训练过程中,我们有多个

categories (e.g. elephant,horse) and domains (e.g. photo, cartoon). At
类别(例如大象、马)和领域(例如照片、卡通画)的图像。在

test time, we want to recognize unseen categories (e.g. dog, giraffe),
测试时,我们希望像零样本学习(ZSL)那样识别未见类别(例如狗、长颈鹿),

as in ZSL, in unseen domains (e.g. paintings), as in DG, exploiting
同时像领域泛化(DG)那样在未见领域(例如绘画)中进行识别,利用

side information describing seen and unseen categories. 135
描述已见和未见类别的辅助信息。135

4.2 Our CuMix Framework. Given an image (bottom, horse, photo),
4.2 我们的CuMix框架。给定一张图像(底部,马,照片),

we randomly sample one image from the same (middle, photo) and
我们从相同领域(中间,照片)随机采样一张图像,

one from another (top, cartoon) domain. The samples are mixed
并从另一个领域(顶部,卡通画)采样一张图像。这些样本在图像和特征层面通过[latex0](白色块)进行混合,

through ϕ (white blocks) both at image and feature level,with their
它们的特征和标签被投影到嵌入空间ϕ(分别通过[latex1]和

features and labels projected into the embedding space E (by g and
E),并在那里进行比较以计算最终目标。

ω respectively) and there compared to compute the final objective.
请注意,ω在训练过程中会发生变化(顶部部分),改变了领域内和领域间的混合比例。141

Note that ϕ varies during training (top part),changing the mixing
注意,ϕ在训练期间会发生变化(顶部部分),改变了领域内和跨领域的混合比例。

ratios in and across domains. 141
141

4.3 ZSL results on CUB, SUN, AWA and FLO datasets with ResNet-101
4.3 使用ResNet - 101在CUB、SUN、AWA和FLO数据集上的零样本学习(ZSL)结果

features. 144
特征。144

A. 1 Digits-five: plots of the domain (orange) and classification (blue)
A. 1 数字五:定义域(橙色)和分类(蓝色)的绘图

losses during the training phase. 156
训练阶段的损失。156

A. 2 Digits-five: plots of the cross-entropy loss on source samples (orange)
A. 2 数字五:语义分类器在源样本上的交叉熵损失(橙色)绘图

and entropy loss on target sample (blue) for the semantic classifier
以及在目标样本上的熵损失(蓝色)绘图,训练阶段。156

during the training phase. 156
训练阶段。156

A. 3 Digits-five: plots of the entropy loss on single sample (blue) and on the
A. 3 数字五:域分类器在单个样本上的熵损失(蓝色)和

average batch assignments (orange) for the domain classifier during
平均批次分配上的熵损失(橙色)绘图,训练阶段。

the training phase. 157
训练阶段。157

A. 4 Portraits dataset: performances of AdaGraph with respect to the
A. 4 肖像数据集:AdaGraph 在不同源 - 目标对可用的辅助域数量方面的性能

number of auxiliary domains available for different source-target pairs.
图注中报告的年份表示源和目标十年的起始年份。161

The years reported in the captions indicate the starting year of source
图注中报告的年份表示源和目标十年的起始年份。

and target decades. 161
161

List of Tables
表格列表


2.1 Digits datasets: comparison of different models in the multi-source
2.1 数字数据集:多源场景下不同模型的比较

scenario. MNIST (M) and MNIST-m (Mm) are taken as source
MNIST(M)和 MNIST - m(Mm)作为源

domains, USPS (U) as target. 30
多个领域,以美国邮政服务(U)为目标领域。30

2.2 Digits-five [286] setting, comparison of different single source and
2.2 五位数 [286] 设置,不同单源和

multi-source DA models. The first row indicates the target domain
多源领域自适应(DA)模型的比较。第一行表示目标领域

with the others used as sources. 31
其他领域用作源领域。31

2.3 PACS dataset: comparison of different methods using the ResNet
2.3 PACS数据集:使用ResNet架构的不同方法的比较

architecture. The first row indicates the target domain, while all the
第一行表示目标领域,而其他所有领域

others are considered as sources. 31
都被视为源领域。31

2.4 PACS dataset: comparison of different methods using the ResNet
2.4 PACS数据集:在多源多目标设置下使用ResNet架构的不同方法的比较

architecture on the multi-source multi-target setting. The first row
第一行

indicates the two target domains. 32
表示两个目标领域。32

2.5 Office-31 dataset: comparison of different methods using AlexNet.
2.5 Office - 31数据集:使用AlexNet的不同方法的比较。

In the first row we indicate the source (top) and the target domains
在第一行中,我们标明源领域(上)和目标领域

(bottom). 38
(下)。38

2.6 Office-31: comparison with state-of-the-art algorithms. In the first
2.6 Office - 31:与最先进算法的比较。在第一行

row we indicate the source (top) and the target domains (bottom). 40
我们标明源领域(上)和目标领域(下)。40

2.7 Office-Caltech dataset: comparison with state-of-the-art algorithms.
2.7 Office - Caltech数据集:与最先进算法的比较。

In the first row we indicate the source (top) and the target domains
在第一行中,我们标注了源域(顶部)和目标域

(bottom). 40
(底部)。40

2.8 DG accuracy on COLD over different lighting conditions. First row
2.8 不同光照条件下 COLD 数据集上的 DG 准确率。第一行

indicates the target sequence, with the first letters denoting the labo-
表示目标序列,首字母表示实验室 -

ratory and the last the illumination condition (C=Cloudy,S=Sunny ,
实验室以及最后是照明条件 (C=Cloudy,S=Sunny ,

N= Night). Vertical lines separate domains of the same laboratory. *
N= 夜间)。垂直线分隔同一实验室的不同区域。*

indicates the algorithm uses domain knowledge. 47
表示该算法使用了领域知识(domain knowledge)。47

2.9 DG accuracy on COLD over different environments/sensors. First
2.9 不同环境/传感器下,DG在COLD数据集上的准确率。首先

row indicates the target sequence, with the first letters denoting
行表示目标序列,首字母表示

the laboratory and the last the illumination condition (C=Cloudy ,
实验室,最后是照明条件 (C=Cloudy

S=Sunny, N=Night). Vertical lines separate domains with same
S=晴天(Sunny),N=夜晚(Night)。竖线分隔具有相同

illumination condition. * indicates the algorithm uses domain knowledge. 48
光照条件。* 表示该算法使用了领域知识。48

2.10 VPC dataset: average accuracy per class. 49
2.10 VPC数据集:每类的平均准确率。49

2.11 VPC dataset: comparison with state of the art. 49
2.11 VPC数据集:与现有技术水平的比较。49

2.12 SPED dataset: comparison of different models. 50
2.12 特殊教育需求(SPED)数据集:不同模型的比较。50

2.13 Rotated-MNIST dataset: comparison with previous methods. 53
2.13 旋转MNIST数据集:与先前方法的比较。53

2.14 PACS dataset: comparison with previous methods. 55
2.14 PACS数据集:与先前方法的比较。55

2.15 PACS dataset: sensitivity analysis. 55
2.15 PACS数据集:敏感性分析。55

2.16 Example Images from KTH Handtool Dataset 60
2.16 来自KTH手动工具数据集的示例图像60

2.17 Portraits dataset. Ablation study. 75
2.17 肖像数据集。消融研究。75

2.18 CompCars dataset [292]. Comparison with state of the art. denotes
2.18 CompCars数据集 [292]。与最先进技术的比较。 表示

Decaf features as input, denotes VGG-Full. 76
以Decaf特征作为输入, 表示VGG全连接层。76

2.19 CarEvolution [218]: comparison with state of the art. 77
2.19 CarEvolution [218]:与最先进技术的比较。77

2.20 Portraits dataset [292]: performances of the refinement strategy on
2.20 肖像数据集 [292]:细化策略在

the continuous adaptation scenario 77
连续自适应场景下的性能77

3.1 Accuracy of ResNet-50 architectures in the ImageNet-to-Sketch scenario. 9:
3.1 ResNet - 50架构在ImageNet到草图场景中的准确率。9:

3.2 Accuracy of DenseNet-121 architectures in the ImageNet-to-Sketch
3.2 DenseNet - 121架构在ImageNet到草图

scenario. 92
场景中的准确率。92

3.3 Accuracy of VGG-16 architectures in the ImageNet-to-Sketch scenario. 93
3.3 VGG - 16架构在ImageNet到草图场景中的准确率。93

3.4 Results in terms of S and Sp scores for the Visual Decathlon Challenge. 94
3.4 视觉十项全能挑战赛在 SSp 得分方面的结果。94

3.5 Impact of the parameters k0,k1,k2 and k3 of our model using the
3.5 使用ResNet - 50架构在ImageNet到草图场景中,我们模型的参数 k0,k1,k2k3 的影响。k0,k1,k2

ResNet-50 architectures in the ImageNet-to-Sketch scenario. de-

notes a learned parameter, while * denotes [160] obtained as a special
标注了一个学习参数,而 * 表示作为我们模型的一个特殊情况得到的 [160]

case of our model. 95
案例。95

3.6 Impact of the parameters k0,k1,k2 and k3 of our model using the
3.6 在 ImageNet 到草图场景中,使用 DenseNet - 121 架构时我们模型的参数 k0,k1,k2k3 的影响。

DenseNet-121 architectures in the ImageNet-to-Sketch scenario.
表示一个学习参数,而 * 表示作为我们模型的一个特殊情况得到的 [160]

denotes a learned parameter, while * denotes [160] obtained as a
标注了一个学习参数,而 * 表示作为我们模型的一个特殊情况得到的 [160]

special case of our model. 95
案例。95

3.7 Mean IoU on the Pascal-VOC 2012 dataset for the disjoint incremental
3.7 在 Pascal - VOC 2012 数据集上,不相交增量类学习场景的平均交并比(Mean IoU)。107

class learning scenarios. 107
107

3.8 Mean IoU on the Pascal-VOC 2012 dataset for the overlapped incre-
3.8 在 Pascal - VOC 2012 数据集上,重叠增量类学习场景的平均交并比(Mean IoU)。107

mental class learning scenario. 107
107

3.9 Ablation study of the proposed method on the Pascal-VOC 2012
3.9 在 Pascal - VOC 2012 重叠设置下对所提出方法的消融研究。

overlapped setup. CE and KD denote our cross-entropy and distillation
CEKD 分别表示我们的交叉熵损失和蒸馏损失,而 init 表示我们的初始化策略。108

losses, while init our initialization strategy. 108
108

3.10 Mean IoU on the ADE20K dataset for different incremental class
3.10 在 ADE20K 数据集上,不同增量类学习场景(每一步添加 50 个类)的平均交并比(Mean IoU)。109

learning scenarios, adding 50 classes at each step. 109
每一步添加 50 个类。109

3.11 Mean IoU on the ADE20K dataset for a multi-step incremental class
3.11 在 ADE20K 数据集上,多步增量类学习场景的平均交并比(Mean IoU)。

learning scenario, adding 50 classes in 5 steps. 109
学习场景,分5步添加50个类别。109

3.12 Ablation study of B-DOC on the global (GC), local clustering (LC)
3.12 B - DOC在全局聚类(GC)、局部聚类(LC)上的消融研究

and Triplet loss on the OWR metric. The right column shows the
以及三元组损失在OWR指标上的表现。右列显示的是

average OWR-H over all steps. 127
所有步骤的平均OWR - H。127

3.13 Rejection rates of different techniques for detecting the unknowns.
3.13 不同未知检测技术的拒绝率。

The results are computed using the same feature extractor on the
结果是使用相同的特征提取器在

RGB-D Object dataset. 127
RGB - D物体数据集上计算得出的。127

4.1 Domain Generalization accuracies on PACS with ResNet-18. 146
4.1 使用ResNet - 18在PACS数据集上的领域泛化准确率。146

4.2 Ablation on PACS dataset with ResNet-18 as backbone. 147
4.2 以ResNet - 18为骨干网络在PACS数据集上的消融研究。147

4.3 ZSL+DG scenario on the DomainNet dataset with ResNet-50 as
4.3 在DomainNet数据集上以ResNet - 50为骨干网络的零样本学习+领域泛化(ZSL + DG)场景。

backbone. 148
148

A. 1 PACS dataset: comparison of different methods using the ResNet
A. 1 PACS数据集:使用ResNet架构的不同方法的比较

architecture. The first row indicates the target domain, while all the
第一行表示目标领域,而其他所有行都被视为源领域。括号中的数字表示

others are considered as sources. The numbers in parenthesis indicate
使用目标验证集进行模型选择的结果。157

the results using a target validation set for model selection. 157

A. 2 PACS dataset: comparison of different methods using the ResNet
A. 2 PACS数据集:使用ResNet架构的不同方法的比较

architecture on the multi-source multi-target setting. The first row
多源多目标设置下的架构。第一行

indicates the two target domains. The numbers in parenthesis indicate
表示两个目标领域。括号中的数字表示

the results using a target validation set for model selection. 158
使用目标验证集进行模型选择的结果。158

A. 3 CompCars dataset [292]. Results with ResNet-18 architecture. 160
A. 3 CompCars数据集 [292]。采用ResNet - 18架构的结果。160

B. 1 Comparison of different implementations of LwF-MC on the Pascal-
B. 1 在Pascal - VOC 2012重叠设置下对LwF - MC不同实现方式的比较。163

VOC 2012 overlapped setup. 163
设置。163

B. 2 Comparison of different implementations of LwF-MC on the 50-50
B. 2 在ADE20K数据集的50 - 50设置下对LwF - MC不同实现方式的比较。163

setting of the ADE20K dataset. 163
设置。163

B. 3 Per Class Mean IoU on 19-1 setting of Pascal-VOC 2012. disjoint setup164
B. 3 在Pascal - VOC 2012的19 - 1设置(不相交设置)下的每类平均交并比。164

B. 4 Per Class Mean IoU on 19-1 setting of Pascal-VOC 2012. overlapped
B. 4 在Pascal - VOC 2012的19 - 1设置(重叠设置)下的每类平均交并比。

setup 164
设置164

B. 5 Per Class Mean IoU on 15-5 setting of Pascal-VOC 2012. disjoint setup165
B. 5 在Pascal - VOC 2012的15 - 5设置(不相交设置)下的每类平均交并比。165

B. 6 Per Class Mean IoU on 15-5 setting of Pascal-VOC 2012. overlapped
B. 6 在Pascal - VOC 2012的15 - 5设置(重叠设置)下的每类平均交并比。

setup 165
设置165

B. 7 Per Class Mean IoU on 15-1 setting of Pascal-VOC 2012. disjoint setup165
B. 7 在Pascal - VOC 2012的15 - 1设置(不相交设置)下的每类平均交并比。165

B. 8 Per Class Mean IoU on 15-1 setting of Pascal-VOC 2012. overlapped
B. 8 在Pascal - VOC 2012的15 - 1设置(重叠设置)下的每类平均交并比。

setup 165
设置165

C.1 ZSL+DG scenario on the DomainNet dataset with ResNet-50 as
C.1 在DomainNet数据集上采用零样本学习(ZSL)+数据生成(DG)场景,以ResNet - 50作为

backbone. 168
骨干网络。168

C. 2 Results on DomainNet dataset with Real-Painting as sources and
C. 2 在DomainNet数据集上以真实图像 - 绘画图像为源数据,以ResNet - 50作为骨干网络的实验结果。169

ResNet-50 as backbone. 169
以ResNet - 50作为骨干网络。169

C. 3 ZSL results. 170
C. 3 零样本学习(ZSL)实验结果。170

Chapter 1
第1章


Introduction
引言


1.1 Overview
1.1 概述


A long-standing goal of artificial intelligence and robotics is the implementation of agents that are able to interact in the real world. In order to achieve this goal, a crucial step lays in making the agents understand the current state of the surrounding environment, by providing them with both powerful sensors and the ability to process the information the sensors give them. To this extent, visual cameras are one of the most powerful and information-rich sensors. Indeed, applications requiring visual abilities are countless: from self-driving cars to detecting and handling objects for service robots in homes, from kitting in industrial workshops, to robots filling shelves and shopping baskets in supermarkets, etc., they all imply interacting with a wide variety of objects, which requires a deep understanding of how these objects look like, their visual properties and associated functionalities.
人工智能和机器人技术的一个长期目标是实现能够在现实世界中进行交互的智能体。为了实现这一目标,关键的一步在于让智能体理解周围环境的当前状态,这需要为它们配备强大的传感器,并赋予它们处理传感器所提供信息的能力。在这方面,视觉相机是最强大且信息最丰富的传感器之一。实际上,需要视觉能力的应用数不胜数:从自动驾驶汽车到家庭服务机器人对物体的检测和处理,从工业车间的配套工作到超市中机器人填充货架和购物篮等等,所有这些都意味着要与各种各样的物体进行交互,这就需要深入了解这些物体的外观、视觉特性以及相关功能。

Due to the central role that vision has in the path towards developing agents with intelligent, autonomous behaviors, a lot of research efforts have been spent on improving computer and robot vision systems. Within this context, in recent years these fields have seen unprecedented advancements thanks to deep learning architectures [87]. Deep models are very effective in learning discriminative representations from input data, and their applications touch on many different fields, such as natural language processing [180, 45, 55, 296], speech recognition [101, 53, 54] and reinforcement learning [145,182,94] . In the context of computer vision,Convolutional Neural Networks (CNNs) [131] are the leading paradigm. These networks are particularly effective in processing grid-like input data [87] a category to which images belong. The successes of CNNs in computer vision are countless: they have achieved outstanding results in many visual tasks, ranging from object classification [124,98] and detection [83,220] ,to more complex ones such as image captioning [113,295] ,visual question answering [8,284] and motion transfer [242,33] .
由于视觉在开发具有智能、自主行为的智能体的过程中起着核心作用,因此人们投入了大量的研究精力来改进计算机和机器人视觉系统。在这种背景下,近年来,得益于深度学习架构[87],这些领域取得了前所未有的进展。深度模型在从输入数据中学习判别性表示方面非常有效,其应用涉及许多不同的领域,如自然语言处理[180, 45, 55, 296]、语音识别[101, 53, 54]和强化学习[145,182,94]。在计算机视觉领域,卷积神经网络(CNNs)[131]是主要的范式。这些网络在处理网格状输入数据[87](图像就属于这一类别)方面特别有效。卷积神经网络在计算机视觉领域的成功案例数不胜数:它们在许多视觉任务中取得了出色的成果,从目标分类[124,98]和检测[83,220]到更复杂的任务,如图像描述[113,295]、视觉问答[8,284]和运动迁移[242,33]

Despite their effectiveness, CNNs have some drawbacks. First, they are data-hungry, i.e. very large labeled datasets are usually required for training them [225]. This is a major issue since it is hard to obtain a large amount of labeled data for any possible application scenario. For instance, this often happens in robotics, where data acquisition and annotation are especially time-consuming and often infeasible.
尽管卷积神经网络很有效,但它们也有一些缺点。首先,它们对数据的需求量很大,即通常需要非常大的标注数据集来训练它们[225]。这是一个主要问题,因为很难为任何可能的应用场景获取大量的标注数据。例如,在机器人技术中经常会出现这种情况,在那里数据采集和标注特别耗时,而且往往不可行。

Another major limitation of deep architectures is that their effectiveness is limited to the particular set of knowledge present in their training set, relying on the closed world assumption (CWA) [254]. This assumption rarely holds in practice and, due to the large variability of the real world, training and test images may differ significantly in terms of visual appearance, or may even contain different semantic categories. As a simple example, let us consider the scenario represented in Figure 1.1. If we train a system to recognize animals (e.g. elephant and horses) in a given visual domain (e.g. real photos) it will inherently assume that (i) those animals are the only animals we want to recognize and (ii) that they will always appear under the distribution of real images. What will eventually come as no surprise is that the model will struggle in distinguishing the same animals in a different visual domain (e.g. paintings) and it will never be able to recognize animals (e.g. dog and giraffe) not present in its initial training set. This was a toy example but, in reality, applications where we would like to adapt a model to new input distributions and/or semantics, are countless. For example, given a robot manipulation task we cannot forecast a priori all the possible conditions (e.g. environments, lighting) it will be employed in. Moreover, we might have data only for a subset of objects we would like to recognize, at least initially. Similar reasoning applies to autonomous driving, where it is nearly impossible to collect data for every possible driving condition (e.g. weather, road), and the semantic categories we want to recognize might change with the location (e.g. region-specific animals) or purpose of the vehicle (e.g. garbage collector).
深度架构的另一个主要局限性是,它们的有效性仅限于其训练集中存在的特定知识集,依赖于封闭世界假设(CWA)[254]。这种假设在实践中很少成立,而且由于现实世界的巨大可变性,训练图像和测试图像在视觉外观上可能会有显著差异,甚至可能包含不同的语义类别。举一个简单的例子,让我们考虑图1.1所示的场景。如果我们训练一个系统在给定的视觉领域(如真实照片)中识别动物(如大象和马),它会固有地假设(i)这些动物是我们唯一想要识别的动物,并且(ii)它们总是以真实图像的分布形式出现。不出所料,该模型在不同的视觉领域(如画作)中区分相同的动物时会遇到困难,并且永远无法识别其初始训练集中不存在的动物(如狗和长颈鹿)。这只是一个简单的例子,但实际上,我们希望将模型适应新的输入分布和/或语义的应用场景数不胜数。例如,对于一个机器人操作任务,我们无法预先预测它将在哪些可能的条件(如环境、光照)下使用。此外,至少在最初阶段,我们可能只有一部分想要识别的物体的数据。类似的情况也适用于自动驾驶,在那里几乎不可能为每一种可能的驾驶条件(如天气、道路)收集数据,而且我们想要识别的语义类别可能会随着车辆的位置(如特定地区的动物)或用途(如垃圾收集车)而改变。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_21.jpg?x=264&y=265&w=1119&h=631&r=0

Figure 1.1. Overview of our research problem. Suppose we are given an initial training set composed of images of a set of classes (e.g. elephant, horse) acquired in a given domain (e.g. real photos). Two main discrepancies can occur at test time: either images contain the same semantics but in different domains (e.g. paintings, bottom-left) or they contain images of the same domain but depicting different semantic concepts (e.g. dog and giraffe, top-right). In the first case we talk about domain shift problem, while the second considers the semantic shift problem. The goal of this thesis is to address the two problems together (bottom-right), i.e. recognizing new semantic concepts (e.g. dog, giraffe) in new visual domains (paintings).
图1.1. 我们研究问题的概述。假设我们有一个初始训练集,它由在给定领域(例如真实照片)中获取的一组类别的图像(例如大象、马)组成。在测试时可能会出现两种主要差异:要么图像包含相同的语义,但处于不同的领域(例如绘画,左下角);要么它们包含相同领域的图像,但描绘了不同的语义概念(例如狗和长颈鹿,右上角)。在第一种情况下,我们讨论的是领域偏移问题,而第二种情况考虑的是语义偏移问题。本论文的目标是同时解决这两个问题(右下角),即在新的视觉领域(绘画)中识别新的语义概念(例如狗、长颈鹿)。


The goal of this thesis is to address these two problems together. In particular, we want to extend the effectiveness of deep architectures to visual domains and semantic concepts not included in the initial training set, with the long-term goal of building visual recognition systems capable of recognizing new semantic concepts in new visual domains.
本论文的目标是同时解决这两个问题。具体而言,我们希望将深度架构的有效性扩展到初始训练集中未包含的视觉领域和语义概念,其长期目标是构建能够在新的视觉领域中识别新语义概念的视觉识别系统。

1.1.1 Domain shift: generalizing to new visual domains
1.1.1 领域偏移:向新的视觉领域泛化


To recognize new semantic concepts in new visual domains, the first problem we must face is generalizing to new visual domains, by overcoming the domain shift problem. To this extent, Domain Adaptation (DA) methods [48, 270] are specifically designed to transfer knowledge from a source domain, where a large amount of labeled data are available, to a domain of interest, i.e. the target domain where few or no labeled data are available. While standard approaches usually focus on a single-source and single-target scenario [77,156] ,a large variety of settings exist depending on the information we have about our source and target domains. For instance, we might have multiple sources and/or multiple target domains, as in multi-source DA [286,308] ,and multi-target DA [43,81] . In these cases,a naive application of single- source/target domain adaptation algorithms would not suffice, consequently leading to poor results. Moreover, the domains might be either explicitly divided or unified in a mixed dataset. Thus, we must discover the various domains required for effectively addressing the domain shift problem [85,283,104] . While standard DA assumes that data of the target domain are available during the initial training phase, a more realistic scenario is that, initially, we do not have any image of the target domain at all. This problem arises in practice every time our systems are employed in unseen environments such as novel viewpoints, illumination, or weather conditions. There are three possible ways to tackle this problem, depending on the information we have on our target domain.
为了在新的视觉领域中识别新的语义概念,我们必须面对的第一个问题是通过克服领域偏移问题,向新的视觉领域进行泛化。在这方面,领域自适应(Domain Adaptation,DA)方法 [48, 270] 专门用于将知识从有大量标注数据的源领域转移到感兴趣的领域,即几乎没有或没有标注数据的目标领域。虽然标准方法通常关注单源单目标场景 [77,156],但根据我们对源领域和目标领域的了解,存在多种设置。例如,我们可能有多个源领域和/或多个目标领域,如多源DA [286,308] 和多目标DA [43,81]。在这些情况下,简单地应用单源/目标领域自适应算法是不够的,从而导致结果不佳。此外,这些领域可能在混合数据集中被明确划分或统一。因此,我们必须发现有效解决领域偏移问题所需的各个领域 [85,283,104]。虽然标准的DA假设在初始训练阶段可以获得目标领域的数据,但更现实的情况是,最初我们根本没有目标领域的任何图像。每当我们的系统在未见的环境(如新颖的视角、光照或天气条件)中使用时,这个问题就会在实践中出现。根据我们对目标领域的了解,有三种可能的方法来解决这个问题。

In case we have no information about our target but we have multiple source domains, we can address this problem by disentangling domain-specific and domain-agnostic components, thereby building a model robust to any possible target domain shift. This is the goal of domain generalization (DG) that has recently raised a lot of interest in the community [133,135,27] . Differently,if we have no information about our target and a single source domain, we cannot disentangle domain and semantic specific components. In this scenario, the only feasible strategy is to dynamically adapt our model as we receive target domain data at test time, in a continuous fashion. This setting is called Continuous DA and multiple works tried to address it before the deep learning breakthrough by e.g. manifold-based techniques [103] and low-rank exemplar SVM [139].
如果我们对目标领域一无所知,但有多个源领域,我们可以通过分离特定领域和与领域无关的组件来解决这个问题,从而构建一个对任何可能的目标领域偏移都具有鲁棒性的模型。这就是领域泛化(Domain Generalization,DG)的目标,最近它在学术界引起了很多关注 [133,135,27]。不同的是,如果我们对目标领域一无所知,且只有一个源领域,我们就无法分离特定领域和特定语义的组件。在这种情况下,唯一可行的策略是在测试时不断接收目标领域的数据时,动态地调整我们的模型。这种设置称为连续DA,在深度学习取得突破之前,有多项工作尝试通过例如基于流形的技术 [103] 和低秩样本支持向量机 [139] 来解决这个问题。

Eventually, we could have information about the target domain shift in the form of metadata describing the visual inputs we should expect. This scenario is called Predictive DA (PDA) and assumes the presence of a single source domain and multiple auxiliary ones and that each domain has its own respective metadata [293]. Understanding how a metadata links to the domain-specific parameters, allows us to infer a model for any target domain given its respective description.
最终,我们可能会以描述预期视觉输入的元数据的形式获得有关目标领域偏移的信息。这种情况称为预测性DA(Predictive DA,PDA),它假设存在一个源领域和多个辅助领域,并且每个领域都有其各自的元数据 [293]。了解元数据如何与特定领域的参数相关联,使我们能够根据目标领域的相应描述推断出适用于该目标领域的模型。

The first part of this thesis describes how we provided solutions for the domain-shift problem, regardless of the information we have about our source/target domain. We started from the latent domain discovery problem, where we assume to have data of both source and target domains but with the two being mixtures of multiple hidden domains. In this particular scenario, we show how a weighted version of batch-normalization (BN) [109], coupled with a domain discovery branch can equip a deep architecture with the ability to discover latent domains for DA [169, 168]. We will show how, the same domain classifier can be applied to the more complex DG task, where no data is available about our target domain. In particular, the similarity among the domains can be used either within the network (i.e. through BN layers [164]) or at classification level [163] to effectively tackle DG. Finally, we will extend BN-based DA algorithms to the PDA scenario by relating domains and their specific parameters through a graph, where each node is a domain (with attached parameters) and the weight of each edge depends on the similarity among the domains, as given by the available metadata [165]. Moreover, we provide a simple extension of BN to tackle the Continuous DA problem, showing the effectiveness of this algorithm both on challenging robotics scenarios [166] and as a tool to refine the target model predicted by our PDA algorithm [165].
本论文的第一部分描述了我们如何为领域偏移问题提供解决方案,无论我们对源/目标领域有多少了解。我们从潜在领域发现问题入手,在这个问题中,我们假设同时拥有源领域和目标领域的数据,但这两个领域是多个隐藏领域的混合。在这个特定场景中,我们展示了加权版本的批量归一化(Batch Normalization,BN)[109],再结合一个领域发现分支,如何使深度架构具备为领域自适应(Domain Adaptation,DA)发现潜在领域的能力[169, 168]。我们将展示如何将相同的领域分类器应用于更复杂的领域泛化(Domain Generalization,DG)任务,在该任务中,我们没有目标领域的数据。具体而言,领域之间的相似性可以在网络内部(即通过BN层[164])或分类层面[163]加以利用,以有效解决DG问题。最后,我们将基于BN的DA算法扩展到部分领域自适应(Partial Domain Adaptation,PDA)场景,通过一个图将领域及其特定参数关联起来,其中每个节点代表一个领域(附带参数),每条边的权重取决于领域之间的相似性,该相似性由可用的元数据给出[165]。此外,我们对BN进行了简单扩展以解决连续领域自适应(Continuous Domain Adaptation)问题,展示了该算法在具有挑战性的机器人场景[166]中的有效性,以及作为一种改进我们的PDA算法所预测的目标模型的工具的有效性[165]。

1.1.2 Semantic shift: breaking model's semantic limits
1.1.2 语义偏移:突破模型的语义限制


The second major problem we must tackle, if we want to recognize new semantic concepts in unseen domains, is to understand how to integrate novel knowledge within our deep architecture, thereby overcoming the semantic shift problem. To this extent, multiple works have tried to extend the knowledge base of a pre-trained deep model, and, depending on the information we have regarding the new concepts, we can split them into three main categories.
如果我们想在未见领域中识别新的语义概念,必须解决的第二个主要问题是理解如何将新知识融入我们的深度架构,从而克服语义偏移问题。为此,多项研究尝试扩展预训练深度模型的知识库,根据我们对新概念的了解,可将这些研究分为三大类。

In the case where we have data available for our new concepts, we are in the incremental learning scenario [216,118,144] . In incremental learning (IL),we have a pre-trained model and we receive data of the new classes/tasks in successive learning stages without having access to the original training set. The goal is to sequentially learn new classes/tasks as new data are available while not forgetting previous knowledge, thereby addressing the catastrophic forgetting problem.
当我们有新概念的数据时,就处于增量学习场景[216,118,144]。在增量学习(Incremental Learning,IL)中,我们有一个预训练模型,并且在连续的学习阶段接收新类别/任务的数据,而无法访问原始训练集。目标是在新数据可用时依次学习新类别/任务,同时不忘记先前的知识,从而解决灾难性遗忘问题。

A special case is when we want our model to not only acquire new knowledge but also to detect unseen concepts. This is the goal of open-world recognition (OWR), where the task is to classify images if they belong to the categories of the training set, to spot samples corresponding to unknown classes, and based on such unknown class detections update the model to progressively include the novel categories [15].
一种特殊情况是,我们希望模型不仅能获取新知识,还能检测未见概念。这就是开放世界识别(Open-World Recognition,OWR)的目标,其任务是对属于训练集类别的图像进行分类,识别对应未知类别的样本,并基于这些未知类别检测结果更新模型,逐步纳入新类别[15]。

A second scenario assumes that just one or few samples are available for the novel semantic concepts. This is the case of one and few-shot learning [66, 266, 246, 255], where we make use of the available training data to build a model capable of inferring the classifier for the novel classes, given a little amount of data. Solutions to this problem usually rely on classifier regression [121], weight imprinting [211, 246] and meta-learning techniques [68, 253].
第二种场景假设新语义概念只有一个或少数样本可用。这就是单样本和少样本学习(One and Few-Shot Learning)的情况[66, 266, 246, 255],在这种情况下,我们利用可用的训练数据构建一个模型,使其能够在少量数据的情况下推断出新类别的分类器。解决这个问题的方法通常依赖于分类器回归[121]、权重印记[211, 246]和元学习技术[68, 253]。

Finally, we might face the extreme case where no training data is available for the new categories we want to recognize. This research thread is Zero-Shot Learning [130,1,278] where the goal is to recognize semantic concepts that were not seen during training, given external information about the novel classes. This information is available either in the form of manually annotated attributes, visual descriptions, or word embeddings [2,278] .
最后,我们可能会遇到极端情况,即对于我们想要识别的新类别没有可用的训练数据。这一研究方向是零样本学习(Zero-Shot Learning)[130,1,278],其目标是在给定关于新类别的外部信息的情况下,识别训练期间未见的语义概念。这些信息可以是手动标注的属性、视觉描述或词嵌入的形式[2,278]
In the second part of the thesis, we explore ways to include novel semantic concepts within a pre-trained architecture. In particular, we start by considering multi-task/domain learning, where the goal is to sequentially learn multiple classifiers for different domains/tasks from a single pre-trained model. To this extent, we propose an algorithm based on task-specific binary masks applied on top of the parameters of the pre-trained model. We show how while requiring very few additional parameters, our algorithm achieves performance comparable to task-specific fine-tuned models.
在论文的第二部分,我们探索了将新语义概念融入预训练架构的方法。具体而言,我们从多任务/领域学习开始,其目标是从单个预训练模型中依次学习不同领域/任务的多个分类器。为此,我们提出了一种基于特定任务二进制掩码的算法,该掩码应用于预训练模型的参数之上。我们展示了在只需要很少额外参数的情况下,我们的算法如何实现与特定任务微调模型相当的性能。

Furthermore, we move towards the incremental class learning scenario, considering OWR. For this, we develop the first end-to-end trainable architecture for OWR [167], based on a deep extension of non-parametric classifiers, i.e. NCM and NNO [177,95,15] . We also show how we can improve the performances of this algorithm by considering clustering strategies that can push samples closer to their class-specific centroid while distancing them from the ones of other classes [69].
此外,我们转向增量类别学习场景,考虑开放世界识别(OWR)。为此,我们基于非参数分类器(即最近类均值分类器(Nearest Class Mean,NCM)和最近邻分类器(Nearest Neighbor,NNO))的深度扩展,开发了第一个可端到端训练的OWR架构[167][177,95,15]。我们还展示了如何通过考虑聚类策略来提高该算法的性能,这些策略可以使样本更接近其特定类别的质心,同时与其他类别的质心保持距离[69]。

Finally, we explore the application of incremental class learning (ICL) techniques in the task of semantic segmentation [31]. Here we discover that the performance of standard approaches is hampered by the semantic content of the background class, which changes among different incremental steps. We call this problem background semantic shift and we provide the first solution to it through a simple yet effective modification of the logits used within standard distillation and entropy-based losses.
最后,我们探索了增量类学习(ICL)技术在语义分割任务中的应用[31]。在此,我们发现标准方法的性能受到背景类语义内容的阻碍,该内容在不同的增量步骤中会发生变化。我们将此问题称为背景语义漂移,并通过对标准蒸馏和基于熵的损失中使用的对数几率进行简单而有效的修改,首次为该问题提供了解决方案。

1.1.3 Recognizing unseen categories in unseen domains
1.1.3 在未见领域中识别未见类别


An open research question is whether we can address the domain and semantic shift problems together, producing a deep model able to recognize new semantic concepts in possibly unseen domains. In the third part of this thesis, we will start analyzing how we can merge these two worlds, providing a first attempt in this direction in an offline but quite extreme setting. In particular, we consider a scenario where, during training, we are given a set of images of multiple domains and semantic categories and our goal is to build a model that can to recognize images of unseen concepts, as in ZSL, in unseen domains, as in DG. This new problem, which we called ZSL+DG, poses novel research questions which go beyond the ones of DG and ZSL problems, if taken in isolation. For instance, we can rely on the fact that multiple source domains permit to disentangle semantic and domain-specific information, as in DG. Despite this, we have no guarantee that the disentanglement will hold for the unseen semantic categories at test time. Additionally, while in ZSL it is reasonable to assume that the learned mapping between images and semantic attributes will generalize also to images of unseen concepts, in ZSL+DG we have no guarantee that this will happen for images of unseen domains.
一个开放性的研究问题是,我们是否能够同时解决领域和语义漂移问题,从而构建一个能够在可能未见的领域中识别新语义概念的深度模型。在本论文的第三部分,我们将开始分析如何将这两个方面结合起来,并在离线但相当极端的设置下首次朝着这个方向进行尝试。具体而言,我们考虑这样一种场景:在训练过程中,我们获得了多个领域和语义类别的一组图像,我们的目标是构建一个模型,该模型能够像零样本学习(ZSL)那样识别未见概念的图像,同时像领域泛化(DG)那样在未见领域中进行识别。我们将这个新问题称为ZSL + DG,它提出了超越单独考虑DG和ZSL问题的新颖研究问题。例如,我们可以利用多个源领域能够像在DG中那样分离语义和特定领域信息这一事实。尽管如此,我们无法保证在测试时这种分离对于未见语义类别仍然有效。此外,在ZSL中,合理的假设是所学习的图像和语义属性之间的映射也能推广到未见概念的图像上,但在ZSL + DG中,我们无法保证这对于未见领域的图像也会成立。

To tackle this problem, we propose a solution based on a variant of the well-known mixup regularization strategy [301]. In particular, we show how we can use mixup to simulate features of novel domains and semantic concepts during training, achieving state-of-the-art performances in both DG, ZSL, and in the novel ZSL+DG scenario [162]. Up to our knowledge, this is the first algorithm able to work in both worlds, recognizing unseen semantic concepts in unseen domains.
为了解决这个问题,我们提出了一种基于著名的混合增强(mixup)正则化策略变体的解决方案[301]。具体而言,我们展示了如何在训练过程中使用混合增强来模拟新领域和语义概念的特征,从而在领域泛化(DG)、零样本学习(ZSL)以及新颖的ZSL + DG场景中均取得了最先进的性能[162]。据我们所知,这是第一个能够在两个方面都发挥作用的算法,即能够在未见领域中识别未见语义概念。

1.2 Contributions
1.2 贡献


Focusing on visual recognition, this thesis contributes towards developing deep learning architectures able to cope with test images containing both different visual domains (i.e. domain shift) as well as new semantic concepts (i.e. semantic shift) unseen during the initial training phase. To this extent, we can divide the main contributions into three parts. The first contains techniques able to attack the well-known domain shift problem of classical DA by considering non-canonical scenarios where the amount of information regarding either the source(s) or the target(s) domains varies. The second part contains algorithms that are able to extend pre-trained architectures with new semantic concepts (i.e. tasks or classes) using external datasets not available during the initial training phase. The goal of these algorithms is to produce models capable of recognizing previously unseen concepts without hampering the performances on old ones. In the third part, we start exploring the recognition of unseen semantic concepts in unseen visual domains, presenting one of the first works merging these two worlds. In the following, we will describe the specific contributions presented in each part.
本论文聚焦于视觉识别,致力于开发深度学习架构,使其能够处理包含不同视觉领域(即领域偏移)以及在初始训练阶段未见的新语义概念(即语义偏移)的测试图像。在此范围内,我们可以将主要贡献分为三个部分。第一部分包含能够解决经典领域自适应(DA)中著名的领域偏移问题的技术,这些技术考虑了非典型场景,即关于源领域或目标领域的信息量有所不同的情况。第二部分包含能够利用初始训练阶段不可用的外部数据集,将预训练架构扩展到新语义概念(即任务或类别)的算法。这些算法的目标是生成能够识别先前未见概念的模型,同时不影响对旧概念的识别性能。在第三部分,我们开始探索在未见视觉领域中对未见语义概念的识别,展示了将这两个领域融合的首批工作之一。接下来,我们将描述每个部分所呈现的具体贡献。

Modeling the Domain Shift In the context of attacking the domain shift problem, we will present:
对领域偏移进行建模 在解决领域偏移问题的背景下,我们将展示:

  • The first deep learning model capable of discovering latent domains in unsupervised domain adaptation, when the source domain is composed of a mixture of multiple visual domains [169,168,170] . Specifically,the architecture is based on two main components, i.e. a side branch that automatically computes the assignment of each sample to its latent domain and novel layers that exploit domain membership information to appropriately align the distribution of the CNN internal feature representations to a reference distribution.
  • 第一个能够在无监督领域自适应中发现潜在领域的深度学习模型,当源领域由多个视觉领域的混合组成时 [169,168,170]。具体而言,该架构基于两个主要组件,即一个自动计算每个样本到其潜在领域分配的侧分支,以及利用领域成员信息将卷积神经网络(CNN)内部特征表示的分布与参考分布进行适当对齐的新型层。

  • Two domain similarity-based frameworks for Domain Generalization [164, 163]. The frameworks rely on the idea that, given a set of different classification models associated with known domains (e.g. corresponding to multiple environments, robots), the best model for a new sample in the novel domain can be computed directly at test time by optimally combining the known models. While in [164] the combination is held out through the statistics of batch-normalization layers [109], in [163] a similar principle is applied at classification level.
  • 两种基于领域相似性的领域泛化框架 [164, 163]。这些框架基于这样的理念:给定一组与已知领域(例如,对应于多个环境、机器人)相关联的不同分类模型,对于新领域中的新样本,最佳模型可以在测试时通过对已知模型进行最优组合直接计算得出。在 [164] 中,这种组合是通过批量归一化层 [109] 的统计信息实现的,而在 [163] 中,类似的原理应用于分类层面。

  • A simple yet effective algorithm for Continuous DA in Robotics [166]. The algorithm is based on an online update of standard batch-normalization layers. We show the effectiveness of our algorithm on a newly collected dataset with challenging robotic scenarios, containing various illumination conditions, backgrounds, and viewpoints.
  • 一种简单而有效的机器人连续领域自适应(Continuous DA)算法 [166]。该算法基于标准批量归一化层的在线更新。我们在一个新收集的具有挑战性的机器人场景数据集上展示了该算法的有效性,该数据集包含各种光照条件、背景和视角。

  • The first deep learning model that can tackle Predictive DA [165]. In this scenario no target data are available and the system has to learn to generalize from annotated source images plus unlabeled samples with associated meta-data from auxiliary domains. We inject metadata information within a deep architecture by encoding the relation between different domains through a graph. Given the target domain metadata, our approach produces the target model by a weighted combination of the domain-specific parameters associated to the graph nodes. We also propose to refine the predicted target model through the incoming stream of target data directly at test time, extending [166].
  • 第一个能够处理预测性领域自适应(Predictive DA)的深度学习模型 [165]。在这种场景下,没有目标数据可用,系统必须学会从带注释的源图像以及来自辅助领域的带有相关元数据的未标记样本中进行泛化。我们通过图来编码不同领域之间的关系,将元数据信息注入到深度架构中。给定目标领域的元数据,我们的方法通过与图节点相关联的特定领域参数的加权组合来生成目标模型。我们还提议在测试时直接通过传入的目标数据流来细化预测的目标模型,对 [166] 进行扩展。
Modeling the Semantic Shift. In the context of including new semantic concepts to a pre-trained architecture, we will present:
对语义偏移进行建模 在将新语义概念纳入预训练架构的背景下,我们将展示:

  • An effective algorithm performing multi-domain learning [171, 172]. The algorithm builds on previous works by masking the weights of a pre-trained architecture through task/domain-specific binary filters [160]. However, we take into account more elaborated affine transformations of the binary masks, showing that our generalization achieves significantly higher levels of adaptation to new tasks, with performances comparable to fine-tuning strategies while requiring slightly more than 1 bit per network parameter per additional task. With this strategy, we achieve results close to the state of the art in the Visual Domain Decathlon challenge [214].
  • 一种执行多领域学习的有效算法 [171, 172]。该算法基于先前的工作,通过特定任务/领域的二进制滤波器 [160] 对预训练架构的权重进行掩码。然而,我们考虑了二进制掩码更精细的仿射变换,表明我们的泛化方法在适应新任务方面达到了显著更高的水平,其性能与微调策略相当,而每个额外任务每个网络参数仅需略多于 1 位。通过这种策略,我们在视觉领域十项全能挑战 [214] 中取得了接近当前最优水平的结果。

  • An incremental class learning algorithm for semantic segmentation which explicitly models the background semantic shift problem [31]. In particular, we identify and analyze the problem of semantic shift of the background class in incremental learning for semantic segmentation. This problem arises since the background class might contain both old as well as still unseen classes. This exacerbates the catastrophic forgetting problem and hampers the ability to learn novel concepts. To tackle this issue, we propose a new distillation-based algorithm with an objective function and a classifier initialization strategy that explicitly model the semantic shift of the background class. The proposed algorithm largely outperforms standard incremental learning methods in different benchmarks.
  • 一种用于语义分割的增量类学习算法,该算法明确对背景语义偏移问题进行建模 [31]。具体而言,我们识别并分析了语义分割增量学习中背景类的语义偏移问题。这个问题的出现是因为背景类可能既包含旧类也包含尚未见过的类。这加剧了灾难性遗忘问题,并阻碍了学习新的概念的能力。为了解决这个问题,我们提出了一种基于蒸馏的新算法,该算法具有一个目标函数和一个分类器初始化策略,能够明确对背景类的语义偏移进行建模。所提出的算法在不同的基准测试中大大优于标准的增量学习方法。

  • The first deep architecture that can to perform open-world recognition (OWR) [167]. The proposed deep network is based on a deep extension of a nonparametric model, [15] and it can detect whether a perceived object belongs to the set of categories known by the system and learns without the need to retrain the whole system from scratch. In a first study [167], we considered both the cases where annotated images about the new category can be provided by an 'oracle' (i.e. human supervision), or by autonomous mining of the Web. In a second instance [69], we show how clustering-based techniques can boost the performances of this OWR framework.
  • 第一个能够执行开放世界识别(Open-World Recognition,OWR)的深度架构 [167]。所提出的深度网络基于非参数模型的深度扩展 [15],它可以检测感知到的对象是否属于系统已知的类别集合,并且无需从头重新训练整个系统即可进行学习。在第一项研究 [167] 中,我们考虑了两种情况:一种是“神谕”(即人工监督)可以提供关于新类别的带注释图像,另一种是通过自主挖掘网络来获取。在第二项研究 [69] 中,我们展示了基于聚类的技术如何提升这个 OWR 框架的性能。

Modeling the Semantic and Domain Shift together. In the context of merging the two worlds, we will describe the new ZSL+DG problem [162] where, at test time, images of unseen domains as well as unseen classes must be correctly classified. Additionally, we will present the first holistic method capable of addressing ZSL and DG individually and both combined together (ZSL+DG). Our method is based on simulating new domains and categories during training by mixing the available training domains and classes both at the image and feature levels. The mixing strategy becomes increasingly more challenging during training, in a curriculum fashion. The extensive experimental analysis show the effectiveness of our approach in all settings: ZSL, DG, and ZSL+DG.
同时对语义和领域偏移进行建模。在融合这两个领域的背景下,我们将描述新的零样本学习与领域泛化(Zero-Shot Learning + Domain Generalization,ZSL+DG)问题 [162],在测试时,必须正确分类未见领域以及未见类别的图像。此外,我们将提出第一种能够分别处理零样本学习(ZSL)和领域泛化(DG)以及同时处理两者(ZSL+DG)的整体方法。我们的方法基于在训练期间通过在图像和特征层面混合可用的训练领域和类别来模拟新的领域和类别。在训练过程中,这种混合策略以课程式的方式变得越来越具有挑战性。广泛的实验分析表明,我们的方法在所有场景下都有效:ZSL、DG 和 ZSL+DG。

1.3 Outline
1.3 大纲


Chapter 2 will discuss the domain shift problem. It will first give an overview of the problems (Section 2.1) and the related works (Section 2.2), delving into the details of the Domain Alignment Layers of [29,28] ,which serve as a starting point for our works. In Section 2.4, we will describe our multi-domain Alignment Layers which allows us to model multiple but mixed source domains through weighted normalization and a domain classifier for unsupervised domain adaptation. In Section 2.5,2.6 and 2.7 we will consider the case where no target data are available. In particular, in Section 2.5 we will extend the multi-domain Alignment Layers to the domain generalization scenario and we show how the domain classifier can be used as a proxy to merge activations from layers beyond normalization ones for effective DG. In Section 2.6, we present ONDA, a continuous DA approach which makes use of continuous update of normalization statistics as target data arrive. Finally, in Section 2.7, we present AdaGraph, a first deep learning-based approach for predictive domain adaptation which merges normalization statistics of different layers based on the given vectorized description of the target domain.
第 2 章将讨论领域偏移问题。它将首先概述这些问题(2.1 节)和相关工作(2.2 节),深入探讨 [29,28] 的领域对齐层的细节,这些层是我们工作的起点。在 2.4 节中,我们将描述我们的多领域对齐层,它允许我们通过加权归一化和用于无监督领域自适应的领域分类器对多个但混合的源领域进行建模。在 2.5、2.6 和 2.7 节中,我们将考虑没有目标数据可用的情况。特别是,在 2.5 节中,我们将把多领域对齐层扩展到领域泛化场景,并展示如何将领域分类器用作代理,以合并归一化层之外的层的激活,从而实现有效的领域泛化。在 2.6 节中,我们提出 ONDA,这是一种连续领域自适应(Continuous Domain Adaptation,DA)方法,它在目标数据到达时利用归一化统计信息的连续更新。最后,在 2.7 节中,我们提出 AdaGraph,这是第一种基于深度学习的预测性领域自适应方法,它根据给定的目标领域的向量化描述合并不同层的归一化统计信息。

Chapter 3 will lead us to the semantic shift problem. It will start by presenting a general problem definition (Section 3.1) with an overview of the related works (Section 3.2). It will then describe BAT (Section 3.3), an approach for multi-domain learning where task-specific binary masks are affinely transformed to obtain a good trade-off among performances and parameters. In Section 3.4, we identify the background-shift problem on incremental class learning for semantic segmentation and we describe MiB ,the first method addressing it,by changing how background probabilities are treated in standard entropy losses. Finally, in Section 3.5, we will describe the DeepNNO, a first deep approach for Open World Recognition, and how we can improve this model with clustering and learned rejection thresholds.
第 3 章将引导我们探讨语义偏移问题。它将首先给出一个通用的问题定义(3.1 节),并概述相关工作(3.2 节)。然后将描述 BAT(3.3 节),这是一种多领域学习方法,其中特定任务的二进制掩码经过仿射变换,以在性能和参数之间取得良好的平衡。在 3.4 节中,我们确定了语义分割增量类学习中的背景偏移问题,并描述 MiB,这是第一种解决该问题的方法,通过改变标准熵损失中背景概率的处理方式。最后,在 3.5 节中,我们将描述 DeepNNO,这是第一种用于开放世界识别的深度方法,以及我们如何通过聚类和学习到的拒绝阈值来改进这个模型。

Chapter 4 will discuss the importance of tackling both domain and semantic shift together (Section 4.1) and the works that pushed towards this direction (Section 4.2). We will then present a new task, zero-shot learning under domain generalization and a first holistic method, CuMix, addressing domain and semantic shift together, using increasingly more complex mixing of samples and features.
第 4 章将讨论同时解决领域和语义偏移的重要性(4.1 节)以及推动这一方向的相关工作(4.2 节)。然后我们将提出一个新任务,即领域泛化下的零样本学习,以及第一种整体方法 CuMix,它通过越来越复杂的样本和特征混合来同时解决领域和语义偏移问题。

The thesis concludes by summarizing the findings, open problems, and possible future direction of research in Chapter 5.
论文在第 5 章通过总结研究结果、未解决的问题以及可能的未来研究方向来结束。

1.4 Publications
1.4 发表成果


In the following, the author's publications are listed in chronological order. Note
以下按时间顺序列出作者的发表成果。注意

that some articles (marked with *) have not been included in the thesis.
一些文章(标有 *)未包含在本论文中。

  • * M. Mancini, S. Rota Bulò, E. Ricci, B. Caputo

Learning Deep NBNN Representations for Robust Place Categorization
学习用于鲁棒场所分类的深度最近邻非参数(Deep Nearest Neighbor Non-Parametric,NBNN)表示

IEEE Robotics and Automation Letters, May 2017, vol. 3, n. 2., pp. 1794-1801.
《IEEE 机器人与自动化快报》,2017 年 5 月,第 3 卷,第 2 期,第 1794 - 1801 页。

Presented at IEEE/RSJ International Conference on Intelligent Robots and
在 2017 年 IEEE/RSJ 国际智能机器人与系统会议(IEEE/RSJ International Conference on Intelligent Robots and Systems,IROS)上发表。

Systems (IROS) 2017.

  • M. Mancini, L. Porzi, S. Rota Bulò, B. Caputo, E. Ricci

Boosting Domain Adaptation by Discovering Latent Domains
通过发现潜在领域提升领域自适应能力

IEEE International Conference on Computer Vision and Pattern Recognition
电气与电子工程师协会国际计算机视觉与模式识别会议

(CVPR) 2018. (spotlight)
(CVPR) 2018. (亮点论文)

  • M. Mancini, S. Rota Bulò, B. Caputo, E. Ricci
  • M. 曼奇尼、S. 罗塔·布洛、B. 卡普托、E. 里奇

Robust Place Categorization with Deep Domain Generalization
基于深度领域泛化的鲁棒场所分类

IEEE Robotics and Automation Letters, July 2018, vol. 3, n. 3., pp. 2093-2100.
电气与电子工程师协会机器人与自动化快报,2018年7月,第3卷,第3期,第2093 - 2100页。

  • M. Mancini, E.Ricci, B. Caputo, S. Rota Bulò
  • M. 曼奇尼、E. 里奇、B. 卡普托、S. 罗塔·布洛

Adding New Tasks to a Single Network with Weight Transformations using
使用二进制掩码通过权重变换为单个网络添加新任务

Binary Masks
二进制掩码

European Computer Vision Conference Workshop on Transferring and Adapt-
欧洲计算机视觉会议计算机视觉中源知识迁移与适应研讨会2018。(最佳论文奖荣誉提名)

ing Source Knowledge in Computer Vision 2018. (best paper award hon-
欧洲计算机视觉会议计算机视觉中源知识迁移与适应研讨会2018。(最佳论文奖荣誉提名)

orable mention)
(荣誉提名)

  • M. Mancini, S. Rota Bulò, B. Caputo, E. Ricci
  • M. 曼奇尼、S. 罗塔·布洛、B. 卡普托、E. 里奇

Best sources forward: domain generalization through source-specific nets
最佳源向前传播:通过特定源网络实现领域泛化

IEEE International Conference on Image Processing (ICIP) 2018.
电气与电子工程师协会国际图像处理会议(ICIP) 2018。

  • M. Mancini, H. Karaoguz, E. Ricci, P. Jensfelt, B. Caputo
  • M. 曼奇尼、H. 卡拉奥古兹、E. 里奇、P. 延斯费尔特、B. 卡普托

Kitting in the Wild through Online Domain Adaptation
通过在线领域自适应实现野外套件装配

IEEE/RSJ Ineternational Conference on Intelligent Robots and Systems (IROS)
电气与电子工程师协会/日本机器人协会智能机器人与系统国际会议(IROS)

2018.

  • M. Mancini, H. Karaoguz, E. Ricci, P. Jensfelt, B. Caputo
  • M. 曼奇尼、H. 卡拉奥古兹、E. 里奇、P. 延斯费尔特、B. 卡普托

Knowledge is Never Enough: Towards Web Aided Deep Open World Recogni-
知识永无止境:迈向网络辅助的深度开放世界识别

tion
识别

IEEE International Conference on Robotics and Automation (ICRA) 2019.
电气与电子工程师协会国际机器人与自动化会议(ICRA)2019 年。

  • M. Mancini, S. Rota Bulò, B. Caputo, E. Ricci
  • M. 曼奇尼、S. 罗塔·布洛、B. 卡普托、E. 里奇

AdaGraph: Unifying Predictive and Continuous Domain Adaptation through
AdaGraph:通过图统一预测性和连续性领域自适应

Graphs

IEEE/CVF International Conference on Computer Vision and Pattern Recog-
电气与电子工程师协会/计算机视觉基金会国际计算机视觉与模式识别会议(CVPR)2019 年。(口头报告)

nition (CVPR) 2019. (oral)
(口头报告)

  • * M. Mancini, L. Porzi, F. Cermelli, B. Caputo
  • * M. 曼奇尼、L. 波尔齐、F. 切尔梅利、B. 卡普托

Discovering Latent Domains for Unsupervised Domain Adaptation through
通过一致性发现无监督领域自适应的潜在领域

Consistency
一致性

International Conference on Image Analysis and Processing (ICIAP) 2019.
国际图像分析与处理会议(ICIAP)2019 年。

  • * F. Cermelli, M. Mancini, E. Ricci, B. Caputo The RGB-D Triathlon: Towards Agile Visual Toolboxes for Robots IEEE/RSJ Ineternational Conference on Intelligent Robots and Systems (IROS) 2019.
  • * F. 切尔梅利、M. 曼奇尼、E. 里奇、B. 卡普托 RGB - D 三项全能:迈向机器人的敏捷视觉工具箱 电气与电子工程师协会/日本机器人协会智能机器人与系统国际会议(IROS)2019 年。

  • M. Mancini, L. Porzi, S. Rota Bulò, B. Caputo, E. Ricci Inferring Latent Domains for Unsupervised Deep Domain Adaptation IEEE Transactions on Pattern Analysis & Machine Intelligence 2019.
  • M. 曼奇尼(M. Mancini)、L. 波尔齐(L. Porzi)、S. 罗塔·布卢(S. Rota Bulò)、B. 卡普托(B. Caputo)、E. 里奇(E. Ricci) 无监督深度域适应的潜在域推断 《IEEE 模式分析与机器智能汇刊》2019 年。

  • * L. O. Vasconcelos, M. Mancini, D. Boscaini, B. Caputo, E. Ricci Structured Domain Adaptation for 3D Keypoint Estimation International Conference on 3D Vision (3DV) 2019. ((oral)
  • * L. O. 瓦斯康塞洛斯(L. O. Vasconcelos)、M. 曼奇尼(M. Mancini)、D. 博斯卡伊尼(D. Boscaini)、B. 卡普托(B. Caputo)、E. 里奇(E. Ricci) 3D 关键点估计的结构化域适应 国际 3D 视觉会议(3DV)2019 年。(口头报告)

  • F. Cermelli, M. Mancini, E. Ricci, B. Caputo
  • F. 切尔梅利(F. Cermelli)、M. 曼奇尼(M. Mancini)、E. 里奇(E. Ricci)、B. 卡普托(B. Caputo)

Modeling the Background for Incremental Learning in Semantic Segmentation IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR) 2020.
语义分割增量学习中的背景建模 《IEEE/CVF 国际计算机视觉与模式识别会议》(CVPR)2020 年。

  • M. Mancini, E.Ricci, B. Caputo, S. Rota Bulò
  • M. 曼奇尼(M. Mancini)、E. 里奇(E. Ricci)、B. 卡普托(B. Caputo)、S. 罗塔·布卢(S. Rota Bulò)

Boosting Binary Masks for Multi-Domain Learning through Affine Transformations.
通过仿射变换增强多域学习的二值掩码。

Machine Vision and Applications, June 2020, vol. 31, n. 6, pp. 1-14.
《机器视觉与应用》,2020 年 6 月,第 31 卷,第 6 期,第 1 - 14 页。

  • D. Fontanel, F. Cermelli, M. Mancini, S. Rota Buló, E. Ricci, B. Caputo Boosting Deep Open World Recognition by Clustering IEEE Robotics and Automation Letters, October 2020, vol. 5, no. 4, pp. 5985-5992. Presented at IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2020.
  • D. 丰塔内尔(D. Fontanel)、F. 切尔梅利(F. Cermelli)、M. 曼奇尼(M. Mancini)、S. 罗塔·布卢(S. Rota Bulò)、E. 里奇(E. Ricci)、B. 卡普托(B. Caputo) 通过聚类增强深度开放世界识别 《IEEE 机器人与自动化快报》,2020 年 10 月,第 5 卷,第 4 期,第 5985 - 5992 页。在 2020 年 IEEE/RSJ 国际智能机器人与系统会议(IROS)上发表。

  • M. Mancini, Z. Akata, E. Ricci, B. Caputo
  • M. 曼奇尼(M. Mancini)、Z. 阿卡塔(Z. Akata)、E. 里奇(E. Ricci)、B. 卡普托(B. Caputo)

Towards Recognizing Unseen Categories in Unseen Domains. European Computer Vision Conference (ECCV) 2020. - * L. O. Vasconcelos, M. Mancini, D. Boscaini, S. Rota Buló, B. Caputo, E. Ricci
迈向在未见域中识别未见类别。欧洲计算机视觉会议(ECCV)2020 年。 - * L. O. 瓦斯康塞洛斯(L. O. Vasconcelos)、M. 曼奇尼(M. Mancini)、D. 博斯卡伊尼(D. Boscaini)、S. 罗塔·布卢(S. Rota Bulò)、B. 卡普托(B. Caputo)、E. 里奇(E. Ricci)

Shape Consistent 2D Keypoint estimation under Unsupervised Domain Adaptation.
无监督域适应下的形状一致 2D 关键点估计。

International Conference on Pattern Recognition (ICPR) 2020.
国际模式识别会议(ICPR)2020 年。

Chapter 2 Recognition across New Visual Domains
第 2 章 跨新视觉域的识别


This chapter presents various strategies to tackle the domain shift problem in the presence of different information regarding the source and target domains. We start by providing a general formulation of the problem (Sec. 2.1). We then review related literature (Sec. 2.2), analyzing the Domain Alignment layers for DA (Sec. 2.3), introduced in previous works [29,28,142] . In the remaining sections,we describe how we extended the Domain Alignment layers to address non-canonical DA settings. We start with the latent-domain discovery problem (Sec. 2.4), where we have multiple source/target domains but mixed, i.e. we do not know to which domain each sample belongs to. We describe the first deep learning solution to this problem [169, 168] based on a weighted computation of the batch-normalization statistics [109] both at training (in case of mixed source domains) and at inference time (in case of mixed targets). In Sec. 2.5, we show how a similar approach can be applied to tackle the domain generalization problem [164], removing the assumption of having target data at training time. Additionally, we show how to extend the same idea beyond batch-normalization layers, mixing activations of domain-specific classification modules [163]. In Sec. 2.6, we take a step further, removing the assumption of having multiple source domains during training, developing a model able to adapt to arbitrary target domains at inference time, dynamically updating its internal knowledge, in a continuous fashion [166]. Finally, in Sec. 2.7, we provide a solution to the Predictive DA scenario, where we must use multiple auxiliary domains with associated metadata during training to learn the relationship among metadata and domains. We then exploit this knowledge to generate a model for the target domain given just its description in terms of metadata. Our solution, called AdaGraph [165], is based on multiple domain-specific batch-normalization layers connected through a graph that we use at inference time to produce a model for the target domain. AdaGraph is the first deep learning-based approach to tackle the Predictive DA problem. In [165], we also extend the continuous DA approach in [166] to dynamically refine the predicted models at test time.
本章介绍了在存在关于源域和目标域的不同信息时,解决域偏移问题的各种策略。我们首先对该问题进行一般性表述(2.1 节)。然后回顾相关文献(2.2 节),分析先前工作 [29,28,142] 中引入的用于域适应(DA)的域对齐层(2.3 节)。在其余章节中,我们描述了如何扩展域对齐层以解决非规范的 DA 设置。我们从潜在域发现问题开始(2.4 节),在该问题中,我们有多个源/目标域,但它们是混合的,即我们不知道每个样本属于哪个域。我们描述了基于批量归一化统计量 [109] 的加权计算,针对该问题的第一个深度学习解决方案 [169, 168],无论是在训练时(对于混合源域的情况)还是在推理时(对于混合目标的情况)。在 2.5 节中,我们展示了如何应用类似的方法来解决域泛化问题 [164],去除在训练时需要目标数据的假设。此外,我们展示了如何将相同的想法扩展到批量归一化层之外,混合特定域分类模块的激活 [163]。在 2.6 节中,我们更进一步,去除在训练期间需要多个源域的假设,开发一个能够在推理时适应任意目标域的模型,以连续的方式动态更新其内部知识 [166]。最后,在 2.7 节中,我们为预测性 DA 场景提供了一个解决方案,在该场景中,我们必须在训练期间使用多个带有相关元数据的辅助域来学习元数据和域之间的关系。然后,我们利用这些知识,仅根据目标域的元数据描述为其生成一个模型。我们的解决方案称为 AdaGraph [165],它基于通过图连接的多个特定域批量归一化层,我们在推理时使用该图为目标域生成一个模型。AdaGraph 是第一个基于深度学习的方法来解决预测性 DA 问题。在 [165] 中,我们还将 [166] 中的连续 DA 方法扩展到在测试时动态细化预测模型。

2.1 Problem statement
2.1 问题陈述


As described in Section 1.1.1, the goal of DA algorithms is to transfer knowledge from a large labeled dataset, i.e. the source domain, to a small and/or unlabeled one, i.e. the target. In particular, throughout this work, we will focus on the case where the target domain is either fully unsupervised or not present at all during training.
如 1.1.1 节所述,DA 算法的目标是将知识从一个大的有标签数据集(即源域)转移到一个小的和/或无标签的数据集(即目标域)。特别是,在整个工作中,我们将重点关注目标域在训练期间要么完全无监督要么根本不存在的情况。

The first case is the Unsupervised Domain Adaptation problem (UDA). Formally, we can define the UDA problem as follows. Let us denote with X our input space (e.g. the image space),with Y our output space (e.g. the set of possible semantic classes) and with D the set of possible visual domains (e.g. environments,illumination conditions). Denoting with DsD the set of our source domain(s),we can define our supervised training set as S={(xis,yis,si)}i=1n where xisX,yisY and siDs . Moreover,let us define our unsupervised target dataset as T={(xjt,tj)}j=1m ,with xjtX ,and tjDtD . Note that we assume source and target domains to differ,i.e. DtDs . Moreover,due to the domain shift,each domain has different joint distribution defined over X×Y : we have p(x,ydi)p(x,ydj) with diDsDt,djDsDt and didj . Our goal is to learn a mapping f:XY which is effective for each of our target domain(s) Dt .
第一种情况是无监督领域自适应问题(UDA)。形式上,我们可以将UDA问题定义如下。我们用X表示输入空间(例如图像空间),用Y表示输出空间(例如可能的语义类别集合),用D表示可能的视觉领域集合(例如环境、光照条件)。用DsD表示源领域集合,我们可以将有监督训练集定义为S={(xis,yis,si)}i=1n,其中xisX,yisYsiDs。此外,我们将无监督目标数据集定义为T={(xjt,tj)}j=1m,其中xjtXtjDtD。注意,我们假设源领域和目标领域不同,即DtDs。此外,由于领域偏移,每个领域在X×Y上定义了不同的联合分布:我们有p(x,ydi)p(x,ydj),其中diDsDt,djDsDtdidj。我们的目标是学习一个映射f:XY,该映射对每个目标领域Dt都有效。

From our formulation, we have the standard single-source/target scenario when |Ds|=|Dt|=1 ,while the multi-source scenario when |Ds|>1 . In both cases, T is assumed available during training. In case both S and T are available but at least one of them is composed of an unknown mixture of domains (i.e. |Ds|= ks1,|Dt|=kt1 with unknown ks and/or kt ),we are in the latent domain discovery scenario and we have no domain identifier d in the triplets of S and T .
根据我们的公式,当|Ds|=|Dt|=1时,我们得到标准的单源/目标场景;当|Ds|>1时,得到多源场景。在这两种情况下,假设在训练期间T是可用的。如果ST都可用,但其中至少有一个由未知的领域混合组成(即|Ds|=ks1,|Dt|=kt1,其中ks未知和/或kt未知),我们处于潜在领域发现场景,并且在ST的三元组中没有领域标识符d

In case T is not available during training but |Ds|>1 ,we are in the Domain Generalization (DG) scenario. In this setting, we can exploit the presence of multiple source domains, even latent, to disentangle domain and semantic specific components from our inputs, producing a model robust to any possible target domain.
如果在训练期间T不可用,但|Ds|>1可用,我们处于领域泛化(DG)场景。在这种设置下,我们可以利用多个源领域(即使是潜在的)的存在,从我们的输入中分离出领域特定和语义特定的组件,从而产生一个对任何可能的目标领域都具有鲁棒性的模型。

In T is not available during training and |Ds|=1 ,we cannot disentangle domain-specific and semantic-specific information. However, we can still cope with the domain shift problem in different ways, depending on the information we have about our target. If no information is available, we can only adapt our model at test time, while classifying samples of the target domain. This is known as the Continuous/Online DA scenario.
在训练期间无法获取T,并且在|Ds|=1时,我们无法分离特定领域和特定语义的信息。然而,我们仍然可以根据对目标的了解,以不同的方式应对领域偏移问题。如果没有可用信息,我们只能在测试时调整模型,同时对目标领域的样本进行分类。这被称为连续/在线领域自适应(Continuous/Online DA)场景。

Lastly, another scenario is Predictive DA (PDA). In this case, we have a set of auxiliary domains Da forming an additional training dataset A={(xia,dia)}i=1r . Moreover,the domain identifiers dDsDa are expressed as metadata. Using the auxiliary set A and the domain metadata,we can learn a mapping among metadata and domain-specific parameters. Then,given target metadata dt ,we can infer its domain-specific parameters, reducing the domain shift problem.
最后,另一种场景是预测性领域自适应(Predictive DA,PDA)。在这种情况下,我们有一组辅助领域Da,形成了一个额外的训练数据集A={(xia,dia)}i=1r。此外,领域标识符dDsDa以元数据的形式表示。利用辅助集A和领域元数据,我们可以学习元数据和特定领域参数之间的映射。然后,给定目标元数据dt,我们可以推断其特定领域参数,从而减少领域偏移问题。

In the following section, we will review the relevant literature for DA and each of the previously mentioned problem. As a final remark, it is worth highlighting that, in this chapter, we assume source and target domains sharing the same output space Y . In Chapter 3 we will consider the case where the visual domains are shared among train and test data (i.e. Ds=Dt ) but the semantic classes differ and/or varies over time. Finally, in Chapter 4 we will consider the scenario where both the
在接下来的部分,我们将回顾领域自适应(DA)以及前面提到的每个问题的相关文献。最后值得强调的是,在本章中,我们假设源领域和目标领域共享相同的输出空间Y。在第3章中,我们将考虑训练数据和测试数据共享视觉领域(即Ds=Dt)但语义类别不同和/或随时间变化的情况。最后,在第4章中,我们将考虑训练和测试之间输出和领域空间都不同的场景。
output and the domain space differ among train and test.
训练和测试之间的输出和领域空间不同。

2.2 Related Works
2.2 相关工作


In this section we will review previous works on DA. We start by reviewing DA methods, based on both hand-crafted and deep features, in standard scenarios where target domain data are available. We then review previous works tackling the domain shift problem without target domain data, starting from DG techniques and covering less explored directions, such as Continuous and Predictive DA.
在本节中,我们将回顾之前关于领域自适应(DA)的工作。我们首先回顾在目标领域数据可用的标准场景下,基于手工特征和深度特征的领域自适应方法。然后,我们回顾在没有目标领域数据的情况下解决领域偏移问题的先前工作,从领域泛化(DG)技术开始,涵盖较少探索的方向,如连续领域自适应和预测性领域自适应。

DA methods with hand-crafted features. Earlier DA approaches operate on hand-crafted features and attempt to reduce the discrepancy between the source and the target domains by adopting different strategies. For instance, instance-based methods [108,289,84] develop from the idea of learning classification/regression models by re-weighting source samples according to their similarity with the target data. A different strategy is exploited by feature-based methods, coping with domain shift by learning a common subspace for source and target data such as to obtain domain-invariant representations [86,153,67] . Parameter-based methods [291] address the domain shift problem by discovering a set of shared weights between the source and the target models. However, they usually require labeled target data which is not always available.
基于手工特征的领域自适应方法。早期的领域自适应方法基于手工特征进行操作,并试图通过采用不同的策略来减少源领域和目标领域之间的差异。例如,基于实例的方法[108,289,84]源于根据源样本与目标数据的相似度对其重新加权来学习分类/回归模型的思想。基于特征的方法采用了不同的策略,通过为源数据和目标数据学习一个公共子空间来应对领域偏移,从而获得领域不变的表示[86,153,67]。基于参数的方法[291]通过发现源模型和目标模型之间的一组共享权重来解决领域偏移问题。然而,它们通常需要有标签的目标数据,而这些数据并不总是可用的。

While most earlier DA approaches focus on a single-source and single-target setting, some works have considered the related problem of learning classification models when the training data spans multiple domains [174, 60, 252]. The common idea behind these methods is that when source data arises from multiple distributions, adopting a single source classifier is suboptimal and improved performance can be obtained by leveraging information about multiple domains. However, these methods assume that the domain labels for all source samples are known in advance. In practice, in many applications the information about domains is hidden and latent domains must be discovered into the large training set. Few works have considered this problem in the literature. Hoffman et al. [104] address this task by modeling domains as Gaussian distributions in the feature space and by estimating the membership of each training sample to a source domain using an iterative approach. Gong et al. [85] discover latent domains by devising a nonparametric approach which aims at simultaneously achieving maximum distinctiveness among domains and ensuring that strong discriminative models are learned for each latent domain. In [283] domains are modeled as manifolds and source images representations are learned decoupling information about semantic category and domain. By exploiting these representations the domain assignment labels are inferred using a mutual information based clustering method.
虽然大多数早期的领域自适应方法专注于单源单目标设置,但一些工作考虑了训练数据跨越多个领域时学习分类模型的相关问题[174, 60, 252]。这些方法背后的共同思想是,当源数据来自多个分布时,采用单一的源分类器是次优的,通过利用多个领域的信息可以提高性能。然而,这些方法假设所有源样本的领域标签是预先已知的。实际上,在许多应用中,关于领域的信息是隐藏的,必须在大型训练集中发现潜在领域。文献中很少有工作考虑这个问题。霍夫曼(Hoffman)等人[104]通过将领域建模为特征空间中的高斯分布,并使用迭代方法估计每个训练样本对源领域的隶属度来解决这个任务。龚(Gong)等人[85]通过设计一种非参数方法来发现潜在领域,该方法旨在同时实现领域之间的最大区分度,并确保为每个潜在领域学习到强大的判别模型。在[283]中,领域被建模为流形,并且通过分离语义类别和领域的信息来学习源图像表示。通过利用这些表示,使用基于互信息的聚类方法推断领域分配标签。

Deep Domain Adaptation. Most recent works on DA consider deep architectures and robust domain-invariant features are learned using either supervised neural networks [154,260,77,80,24,28] ,deep autoencoders [299] or generative adversarial networks [22,241] . Research efforts can be grouped in terms of the number of source domains available at training time.
深度领域自适应。最近关于领域自适应(DA)的研究大多采用深度架构,通过有监督神经网络 [154,260,77,80,24,28]、深度自编码器 [299] 或生成对抗网络 [22,241] 来学习鲁棒的领域不变特征。研究工作可以根据训练时可用的源领域数量进行分类。

In the single-source DA setting, we can identify two main strategies. The first deals with features and aims at learning deep domain invariant representations. The idea is to introduce in the learning architecture different measures of domain distribution shift at a single or multiple levels [157,251,28,29] and then train the network to minimize these measures while also reducing a task-specific loss, for instance for classification or detection. In this way the network produces features invariant to the domain shift, but still discriminative for the task at hand. Besides distribution evaluations, other domain shift measures used similarly are the error in the target sample reconstruction [80], or various coherence metrics on the pseudo-labels assigned by the source models to the target data [237,97,229] . Finally,a different group of feature-based methods rely on adversarial loss functions [260, 78]. The method proposed in [232], that push the network to be unable to discriminate whether a sample coming from the source or from the target, is an interesting variant of [78], where the domain difference is still measured at the feature level but passing through an image reconstruction step. Besides integrating the domain discrimination objective into end-to-end classification networks, it has also been shown that two-step networks may have practical advantages [261, 7].
在单源领域自适应设置中,我们可以确定两种主要策略。第一种策略处理特征,旨在学习深度领域不变表示。其思路是在学习架构中引入单级或多级的领域分布偏移度量 [157,251,28,29],然后训练网络以最小化这些度量,同时减少特定任务的损失,例如分类或检测任务的损失。通过这种方式,网络生成的特征对领域偏移具有不变性,但仍对当前任务具有判别性。除了分布评估之外,其他类似使用的领域偏移度量包括目标样本重建误差 [80],或源模型为目标数据分配的伪标签上的各种一致性度量 [237,97,229]。最后,另一类基于特征的方法依赖于对抗损失函数 [260, 78]。文献 [232] 中提出的方法促使网络无法区分样本是来自源领域还是目标领域,这是文献 [78] 的一个有趣变体,其中领域差异仍然在特征层面进行度量,但要经过图像重建步骤。除了将领域判别目标集成到端到端分类网络中,研究还表明两步网络可能具有实际优势 [261, 7]。
The second popular deep adaptive strategy focuses on images. The described adversarial logic that demonstrated its effectiveness for feature-based methods, has also been extended to the goal of reducing the visual domain gap. Powerful GAN [88] methods have been exploited to generate new images or perturb existing ones to resemble the visual style of a certain domain, thus reducing the discrepancy at pixel level [23,241] . Most of the works based on image adaptation aim at generating either target-like source images or source-like target images, but it has been recently shown that integrating both the transformation directions is highly beneficial [226].
第二种流行的深度自适应策略侧重于图像。已证明对基于特征的方法有效的对抗逻辑,也被扩展到缩小视觉领域差距的目标上。强大的生成对抗网络(GAN)[88] 方法已被用于生成新图像或对现有图像进行扰动,使其类似于某个领域的视觉风格,从而减少像素级的差异 [23,241]。大多数基于图像自适应的工作旨在生成类似目标的源图像或类似源的目标图像,但最近的研究表明,整合两种转换方向非常有益 [226]。

In practical applications one may be offered more than one source domain. This has triggered the study of multi-sources DA algorithms. The multi-source setting was initially studied from a theoretical point of view, focusing on theorems indicating how to optimally sub-select the data to be used in learning the source models [47], or proposing principled rules for combining the source-specific classifiers and obtain the ideal target class prediction [174]. Several other works followed this direction in the shallow learning framework. When dealing with shallow-methods the naïve model learned by collecting all the source data in single domain without any adaptation was usually showing low performance on the target. It has been noticed that this behavior changes when moving to deep learning, where the larger number of samples as well as their variability supports generalization and usually provides good results on the target. Only very recently two methods presented multi-source deep learning approaches that improve over this reference. The approach proposed in [286] builds over [78] by replicating the adversarial domain discriminator branch for each available source. Moreover these discriminators are also used to get a perplexity score that indicates how the multiple sources should be combined at test time, according to the rule in [174]. A similar multi-way adversarial strategy is used in [308], but this work comes with a theoretical support that frees it from the need of respecting a specific optimal source combination and thus from the need of learning the source weights.
在实际应用中,可能会有多个源领域可供使用。这引发了对多源领域自适应算法的研究。多源设置最初是从理论角度进行研究的,重点在于定理,这些定理指出如何最优地子选择用于学习源模型的数据 [47],或者提出组合特定源分类器以获得理想目标类预测的原则性规则 [174]。在浅层学习框架中,有几项其他工作也遵循了这一方向。在处理浅层方法时,通过将所有源数据收集到单个领域而不进行任何自适应学习得到的简单模型,通常在目标领域上表现不佳。人们注意到,当转向深度学习时,这种情况会发生变化,因为更多的样本数量及其可变性支持泛化,并且通常在目标领域上能取得良好的结果。直到最近,才有两种方法提出了多源深度学习方法,这些方法比上述参考方法有所改进。文献 [286] 中提出的方法在文献 [78] 的基础上,为每个可用源复制对抗领域判别器分支。此外,这些判别器还用于获得困惑度分数,该分数根据文献 [174] 中的规则指示在测试时应如何组合多个源。文献 [308] 中使用了类似的多路对抗策略,但这项工作有理论支持,使其无需遵循特定的最优源组合,因此也无需学习源权重。

While recent deep DA methods significantly outperform approaches based on hand-crafted features, the vast majority of them only consider single-source, single-target settings. Moreover, almost all work presented in the literature so far assume to have direct access to multiple source domains, where in many practical applications such knowledge might not be directly available, or costly to obtain in terms of time and human annotators. To our knowledge,our works [169,168] are the first works proposing a deep architecture for discovering latent source domains and exploiting them for improving classification performance on target data.
虽然最近的深度领域自适应方法明显优于基于手工特征的方法,但它们中的绝大多数只考虑单源单目标设置。此外,到目前为止,文献中提出的几乎所有工作都假设可以直接访问多个源领域,而在许多实际应用中,这种信息可能无法直接获得,或者在时间和人力标注方面获取成本较高。据我们所知,我们的工作 [169,168] 是首次提出用于发现潜在源领域并利用它们来提高目标数据分类性能的深度架构的工作。
Domain Generalization. Opposite to domain adaptation [48], where it is assumed that target data are available in the training phase, the key idea behind DG is to learn a domain agnostic model to be applied to any unseen target domain. Although less researched than domain adaptation, the need for DG algorithms has been recognized for quite some time in the literature [186].
领域泛化(Domain Generalization)。与领域自适应(domain adaptation)[48]相反,领域自适应假设在训练阶段可以获取目标数据,而领域泛化的核心思想是学习一个与领域无关的模型,以便应用于任何未见的目标领域。尽管与领域自适应相比,领域泛化的研究较少,但文献中很早就认识到了对领域泛化算法的需求[186]。

Previous DG methods can be broadly grouped into four main categories. The first category comprises methods which attempt to learn domain-invariant feature representations by considering specific alignment losses, such as maximum mean discrepancy (MMD), adversarial loss or self-supervised losses. Notable approaches in this category are [186,137,27] . The second category of methods [133,115] develop from the idea of creating deep architectures where both domain-agnostic and domain-specific parameters are learned on source domains. After training, only the domain-agnostic part is retained and used for processing target data. The third category devise specific optimization strategies or training procedures in order to enhance the generalization ability of the source model to unseen target data. For instance, in [134] a meta-learning approach is proposed for DG. Differently, in [135] an episodic training procedure is presented to learn models robust to the domain shift. The latter category comprises methods which introduce data and feature augmentation strategies to synthesise novel samples and improve the generalization capability of the learned model [238,268,267] . These strategies are mostly based either on adversarial training [238, 268] or data augmentation [267].
以往的领域泛化方法大致可分为四大类。第一类方法试图通过考虑特定的对齐损失(如最大均值差异(Maximum Mean Discrepancy,MMD)、对抗损失或自监督损失)来学习领域不变的特征表示。该类中的显著方法有[186,137,27]。第二类方法[133,115]源于这样一种思想:创建深度架构,在源领域上学习与领域无关和特定于领域的参数。训练后,仅保留与领域无关的部分并用于处理目标数据。第三类方法设计特定的优化策略或训练程序,以增强源模型对未见目标数据的泛化能力。例如,文献[134]提出了一种用于领域泛化的元学习方法。不同的是,文献[135]提出了一种情景式训练程序,以学习对领域偏移具有鲁棒性的模型。最后一类方法引入数据和特征增强策略,以合成新样本并提高所学模型的泛化能力[238,268,267]。这些策略大多基于对抗训练[238, 268]或数据增强[267]。

Beyond DG: Domain Adaptation without Target Data. DG assumes that multiple source domains are available, in some applications this assumption might not hold. This calls for DA methods able to cope with the domain shift when i) only one source domain is available and ii) no target data are available in the training phase. Depending on their available information, these methods can work by exploiting e.g. the stream of incoming target samples, or side information describing possible future target domains. Note that, differently from DG, these methods produce models which are not robust to any possible target domain, but must be re-adapted if the target domain changes
超越领域泛化:无目标数据的领域自适应。领域泛化假设存在多个源领域,但在某些应用中,这一假设可能不成立。这就需要领域自适应方法能够在以下情况下应对领域偏移:i) 仅存在一个源领域;ii) 训练阶段没有目标数据。根据可用信息,这些方法可以通过利用例如传入的目标样本流或描述可能的未来目标领域的辅助信息来发挥作用。请注意,与领域泛化不同,这些方法生成的模型并非对任何可能的目标领域都具有鲁棒性,如果目标领域发生变化,则必须重新进行自适应调整。

The first scenario is typically referred as continuous [103] or online DA [166]. To address this problem, in [103] a manifold-based DA technique is employed, such as to model an evolving target data distribution. In [139] Li et al. propose to sequentially update a low-rank exemplar SVM classifier as data of the target domain become available. In [129], the authors propose to extrapolate the target data dynamics within a reproducing kernel Hilbert space.
第一种场景通常被称为连续领域自适应[103]或在线领域自适应[166]。为了解决这个问题,文献[103]采用了一种基于流形的领域自适应技术,以对不断演变的目标数据分布进行建模。文献[139]中,Li等人提出在目标领域的数据可用时,依次更新一个低秩样本支持向量机(SVM)分类器。文献[129]中,作者提出在再生核希尔伯特空间内推断目标数据的动态变化。

The second scenario corresponds to the problem of Predictive DA (PDA). PDA is introduced in [293], where a multivariate regression approach is described for learning a mapping between domain metadata and points in a Grassmanian manifold. Given this mapping and the metadata for the target domain, two different strategies are proposed to infer the target classifier. In Section 2.7, we show how it is possible to address this task with deep architectures, using batch-normalization layers [109].
第二种场景对应于预测性领域自适应(Predictive Domain Adaptation,PDA)问题。预测性领域自适应在文献[293]中被提出,其中描述了一种多元回归方法,用于学习领域元数据与格拉斯曼流形上的点之间的映射。给定这种映射和目标领域的元数据,提出了两种不同的策略来推断目标分类器。在第2.7节中,我们将展示如何使用批量归一化层[109]通过深度架构来解决这一任务。

Other closely related tasks are the problems of zero shot domain adaptation and domain generalization. In zero-shot domain adaptation [205] the task is to learn a prediction model in the target domain under the assumption that task-relevant source-domain data and task-irrelevant dual-domain paired data are available. Domain generalization methods [186,133,62,185] attempt to learn domain-agnostic classification models by exploiting labeled source samples from multiple domains but without having access to target data. Similarly to Predictive DA, in domain generalization multiple datasets are available during training. However, in PDA data from auxiliary source domains are not labeled.
其他密切相关的任务包括零样本领域自适应和领域泛化问题。在零样本领域自适应[205]中,任务是在假设存在与任务相关的源领域数据和与任务无关的双领域配对数据的情况下,学习目标领域的预测模型。领域泛化方法[186,133,62,185]试图通过利用来自多个领域的带标签源样本,在无法获取目标数据的情况下学习与领域无关的分类模型。与预测性领域自适应类似,领域泛化在训练期间有多个数据集可用。然而,在预测性领域自适应中,来自辅助源领域的数据是未标记的。

2.3 Preliminaries: Domain Alignment Layers
2.3 预备知识:域对齐层


Batch-normalization [109] (BN) is a common strategy used in deep architectures for stabilizing the optimization problem, making the gradients more well-behaved, and enabling a faster and more effective training [233,20] . BN works by normalizing the input features to a fixed, target distribution, i.e. a standard Gaussian. Recent works [142,29,28] have shown how we can use BN layers to perform domain adaptation in a traditional batch setting. In the following, we will denote BN layers with domain-specific statistics as Domain Alignment layers (DA-layers).
批量归一化 [109](Batch-normalization,BN)是深度架构中用于稳定优化问题、使梯度表现更好并实现更快更有效训练的常用策略 [233,20]。BN 通过将输入特征归一化为固定的目标分布(即标准高斯分布)来工作。近期的研究 [142,29,28] 展示了如何在传统批量设置中使用 BN 层进行域自适应。在接下来的内容中,我们将使用特定于域的统计信息的 BN 层称为域对齐层(Domain Alignment layers,DA 层)。

DA-layers [142,29,28] are motivated by the observation that,in general,activations within a neural network follow domain-dependent distributions. As a way to reduce domain shift, the activations are thus normalized in a domain-specific way, shifting them according to a parameterized transformation in order to match their first and second-order moments to those of a reference distribution, which is generally chosen to be normal with zero mean and unit standard deviation. While most previous works only considered settings with two domains, i.e. source and target, the basic idea can be applied to any number of domains, as long as the domain membership of each sample point is known. Specifically,denoting as qxd the distribution of activations for a given feature channel and domain d ,an input xdqxd to the DA-layer can be normalized according to
DA 层 [142,29,28] 的灵感来源于这样的观察:一般来说,神经网络内的激活遵循依赖于域的分布。作为减少域偏移的一种方法,激活因此以特定于域的方式进行归一化,根据参数化变换对其进行调整,以使它们的一阶和二阶矩与参考分布的一阶和二阶矩相匹配,参考分布通常选择为均值为零、标准差为 1 的正态分布。虽然大多数先前的研究只考虑了两个域(即源域和目标域)的设置,但只要每个样本点的域归属已知,基本思想就可以应用于任意数量的域。具体来说,将给定特征通道和域 d 的激活分布表示为 qxd,DA 层的输入 xdqxd 可以根据以下方式进行归一化

(2.1)DA(xd;μd,σd)=xdμdσd2+ϵ

where μd=Exqxd[x],σd2=Varxqxd[x] are mean and variance of the input distribution,respectively,and ϵ>0 is a small constant to avoid numerical issues. In practice,when the statistics μd and σd2 are computed over the current mini-batch, we obtain the application of standard batch normalization separately to the sample points of each domain.
其中 μd=Exqxd[x],σd2=Varxqxd[x] 分别是输入分布的均值和方差,ϵ>0 是一个小常数,用于避免数值问题。实际上,当统计量 μdσd2 是在当前小批量上计算时,我们分别对每个域的样本点应用标准批量归一化。

The main idea behind these works is to create a deep architecture with one parallel branch per domain, where all branches share the same parameters but embed different, domain-specific, BN layers (i.e. different statistics within DA-layers). The domain-specific BN layers align the distributions of features of different domains to the same reference distribution, achieving the desired domain adaptation effect. In the following sections, we will show how variants of DA-layers can be successfully applied in multiple distinct DA scenarios, even without the presence of target domain data during the initial training phase.
这些研究背后的主要思想是为每个域创建一个具有并行分支的深度架构,其中所有分支共享相同的参数,但嵌入不同的特定于域的 BN 层(即 DA 层内的不同统计信息)。特定于域的 BN 层将不同域的特征分布对齐到相同的参考分布,从而实现所需的域自适应效果。在接下来的章节中,我们将展示 DA 层的变体如何能够成功应用于多个不同的域自适应场景,即使在初始训练阶段没有目标域数据的情况下也是如此。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_39.jpg?x=264&y=255&w=1119&h=730&r=0

Figure 2.1. The idea behind the proposed framework for latent domain discovery. In this section, we introduce a novel deep architecture which, given a set of images, automatically discovers multiple latent domains and use this information to align the distributions of the internal CNN feature representations of sources and target domains for the purpose of domain adaptation. In this way, more accurate target classifiers can be learned.
图 2.1. 所提出的潜在域发现框架背后的思想。在本节中,我们介绍一种新颖的深度架构,给定一组图像,该架构能够自动发现多个潜在域,并利用这些信息对齐源域和目标域的内部卷积神经网络(CNN)特征表示的分布,以实现域自适应的目的。通过这种方式,可以学习到更准确的目标分类器。


2.4 Latent Domain Discovery 12
2.4 潜在域发现 12


As stated in Section 2.2, the problem of Unsupervised DA has been widely studied and both theoretical results [14,174] and several algorithms have been developed, both considering shallow models [108,84,86,153,67] and deep architectures [154,260 , 77,155,80,28,24] . While deep neural networks tend to produce more transferable and domain-invariant features with respect to shallow models, previous works have shown that the domain shift is only alleviated but not entirely removed [59].
如第 2.2 节所述,无监督域自适应问题已得到广泛研究,并且已经开发出了理论结果 [14,174] 和几种算法,同时考虑了浅层模型 [108,84,86,153,67] 和深度架构 [154,26077,155,80,28,24]。虽然深度神经网络相对于浅层模型倾向于产生更具可迁移性和域不变性的特征,但先前的研究表明,域偏移只是得到了缓解,并没有完全消除 [59]。

Most previous works on UDA focus on a single-source and single-target scenario. However, in many computer vision applications labeled training data are often generated from multiple distributions, i.e. there are multiple source domains. Examples of multi-source DA problems arise when the source set corresponds to images taken with different cameras, collected from the web or associated to multiple points of views. In these cases, a naive application of single-source domain adaptation algorithms would not suffice, leading to poor results. Analogously, target samples may arise from more than a single distribution and learning multiple target-specific models may improve significantly the performance. Therefore, in the past several research efforts have been devoted to develop domain adaptation methods considering multiple source and target domains [174,60,252,286] . However,these approaches assume that the multiple domains are known. A more challenging problem arises when training data correspond to latent domains, i.e. we can make a reasonable estimate on the number of source and target domains available, but we have no information, or only partial, about domain labels. To address this problem, known in the literature as latent domain discovery, previous works have proposed methods which simultaneously discover hidden source domains and use them to learn the target classification models [104, 85, 283].
以往大多数关于无监督域自适应(UDA)的研究都聚焦于单源单目标场景。然而,在许多计算机视觉应用中,有标签的训练数据通常来自多个分布,即存在多个源域。当源数据集对应于使用不同相机拍摄、从网络收集或与多个视角相关的图像时,就会出现多源域自适应问题。在这些情况下,简单地应用单源域自适应算法是不够的,会导致结果不佳。类似地,目标样本可能来自多个分布,学习多个特定于目标的模型可能会显著提高性能。因此,过去有一些研究致力于开发考虑多个源域和目标域的域自适应方法 [174,60,252,286]。然而,这些方法假设多个域是已知的。当训练数据对应于潜在域时,会出现一个更具挑战性的问题,即我们可以对可用的源域和目标域的数量做出合理估计,但对域标签没有信息或只有部分信息。为了解决这个在文献中被称为潜在域发现的问题,以往的研究提出了同时发现隐藏源域并利用它们学习目标分类模型的方法 [104, 85, 283]。


1 M. Mancini,L. Porzi,S. Rota Bulò,B. Caputo,E. Ricci. Boosting Domain Adaptation by Discovering Latent Domains. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2018.
1 M. Mancini,L. Porzi,S. Rota Bulò,B. Caputo,E. Ricci。通过发现潜在域来增强域自适应。IEEE国际计算机视觉与模式识别会议(CVPR)2018。

2 M. Mancini,L. Porzi,S. Rota Bulò,B. Caputo,E. Ricci. Inferring Latent Domains for Unsupervised Deep Domain Adaptation. IEEE Transactions on Pattern Analysis & Machine Intelligence 2019.
2 M. Mancini,L. Porzi,S. Rota Bulò,B. Caputo,E. Ricci。无监督深度域自适应中的潜在域推断。IEEE模式分析与机器智能汇刊2019。


This section introduces the first approaches [169,168] based on deep neural networks able to automatically discover latent domains in multi-source, multi-target UDA setting. Our method is inspired from the Domain Alignment Layers described in Section 2.3, introduced by [28, 29]. Our approach develops from the same intuition of Domain Alignment Layers, i.e. aligning representations of source and target distributions to a reference Gaussian. However, to address the additional challenges of discovering and handling multiple latent domains, we propose a novel architecture which is able to (i) learn a set of assignment variables which associate source and target samples to a latent domain and (ii) exploit this information for aligning the distributions of the internal CNN feature representations and learn robust target classifiers (Fig.2.1). Our experimental evaluation shows that the proposed approach alleviates the domain discrepancy and outperforms previous UDA techniques on popular benchmarks, such as Office-31 [228], PACS [138] and Office-Caltech [86].
本节介绍了第一种基于深度神经网络的方法 [169,168],该方法能够在多源多目标无监督域自适应(UDA)设置中自动发现潜在域。我们的方法受到第2.3节中描述的域对齐层的启发,该层由[28, 29]提出。我们的方法源于与域对齐层相同的直觉,即将源分布和目标分布的表示与参考高斯分布对齐。然而,为了解决发现和处理多个潜在域的额外挑战,我们提出了一种新颖的架构,该架构能够(i)学习一组分配变量,将源样本和目标样本与一个潜在域关联起来;(ii)利用这些信息对齐卷积神经网络(CNN)内部特征表示的分布,并学习鲁棒的目标分类器(图2.1)。我们的实验评估表明,所提出的方法减轻了域差异,并且在流行的基准测试中,如Office - 31 [228]、PACS [138]和Office - Caltech [86],优于以往的无监督域自适应技术。

To summarize, the contributions presented in this section are threefold. Firstly, we introduce a novel deep learning approach for unsupervised domain adaptation which operates in a multi-source, multi-target setting. Secondly, we describe a novel architecture which is not only able to handle multiple domains, but also permits to automatically discover them by grouping source and target samples. Thirdly, our experiments demonstrate that this framework is superior to many state-of-the-art single- and multi-source/target UDA methods.
综上所述,本节的贡献有三个方面。首先,我们介绍了一种在多源多目标设置下进行无监督域自适应的新颖深度学习方法。其次,我们描述了一种新颖的架构,该架构不仅能够处理多个域,还能通过对源样本和目标样本进行分组来自动发现这些域。第三,我们的实验表明,该框架优于许多最先进的单源/多源/目标无监督域自适应方法。

2.4.1 Problem Formulation
2.4.1 问题表述


We assume to have data belonging to one of several domains. Specifically, as in Section 2.1,we consider ks source domains,characterized by unknown probability distributions pxys1,,pxysks defined over X×Y ,where X is the input space (e.g. images) and Y the output space (e.g. object or scene categories) and,similarly,we assume kt target domains characterized by pxyt1,,pxytk . Note that,for simplicity,we wrote =p(x,yd) as pxyd . The numbers of source and target domains are not necessarily known a-priori, and are left as hyperparameters of our method.
我们假设数据属于多个域中的一个。具体来说,与第2.1节一样,我们考虑 ks 个源域,其特征是在 X×Y 上定义的未知概率分布 pxys1,,pxysks,其中 X 是输入空间(如图像),Y 是输出空间(如对象或场景类别)。类似地,我们假设 kt 个目标域,其特征是 pxyt1,,pxytk。注意,为了简单起见,我们将 =p(x,yd) 写为 pxyd。源域和目标域的数量不一定是先验已知的,而是作为我们方法的超参数。

During training we are given a set of labeled sample points from the source domains, and a set of unlabeled sample points from the target domains, while we can have partial or no information about the domain of the source sample points. We model the source data as a set S={(x1s,y1s),,(xns,yns)} of i.i.d. observations from a mixture distribution pxys=i=1ksπsipxysi ,where πsi is the unknown probability of sampling from a source domain si . Similarly,the target sample T={x1t,,xmt} consists of i.i.d. observations from the marginal pxt of the mixture distribution over target domains. Furthermore,we denote by xS={x1s,,xns} and yS={y1s,,yns} , the source data and label sets, respectively. We assume to know the domain label for a (possibly empty) sub-sample S^S from the source domains and we denote by dS^ the domain labels in Ds={s1,sks} of the sample points in xS^ . Note that, differently from the general formulation in Section 2.1,here neither S and T might have domain labels available.
在训练过程中,我们会得到一组来自源域的带标签样本点,以及一组来自目标域的无标签样本点,而我们可能仅掌握源样本点所在域的部分信息,甚至完全没有相关信息。我们将源数据建模为一组来自混合分布 pxys=i=1ksπsipxysi 的独立同分布(i.i.d.)观测值 S={(x1s,y1s),,(xns,yns)},其中 πsi 是从源域 si 中采样的未知概率。类似地,目标样本 T={x1t,,xmt} 由来自目标域上混合分布的边缘分布 pxt 的独立同分布观测值组成。此外,我们分别用 xS={x1s,,xns}yS={y1s,,yns} 表示源数据和标签集。我们假设知道来自源域的一个(可能为空的)子样本 S^S 的域标签,并将 xS^ 中样本点在 Ds={s1,sks} 中的域标签记为 dS^。请注意,与 2.1 节中的一般公式不同,这里 ST 可能都没有可用的域标签。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_41.jpg?x=271&y=272&w=1113&h=380&r=0

Figure 2.2. Schematic representation of our method applied to the AlexNet architecture (left) and of an mDA-layer (right).
图 2.2. 我们的方法应用于 AlexNet 架构(左)和 mDA 层(右)的示意图。


Our goal is to learn a predictor that is able to classify data from the target domains. The major difficulties that this problem poses, and that we have to deal with, are: (i) the distributions of source and target domains can be drastically different, making it hard to apply a classifier learned on one domain to the others, (ii) we lack direct observation of target labels, and (iii) the assignment of each source and target sample point to its domain is unknown, or known for a very limited number of source sample points.
我们的目标是学习一个能够对来自目标域的数据进行分类的预测器。这个问题带来的、且我们必须解决的主要困难有:(i)源域和目标域的分布可能截然不同,这使得将在一个域上学习到的分类器应用到其他域变得困难;(ii)我们缺乏对目标标签的直接观测;(iii)每个源样本点和目标样本点所属的域是未知的,或者仅对极少数源样本点的域是已知的。

Several previous works [154,260,77,80,24,28] have tackled the related problem of domain adaptation in the context of deep neural networks, dealing with (i) and (ii) in the single domain case for both source and target data (i.e. ks=1 and kt=1 ). In particular, some recent works have demonstrated a simple yet effective approach based on the replacement of standard BN layers with specific Domain Alignment layers [29,28] . These layers reduce internal domain shift at different levels within the network by normalizing features in a domain-dependent way, matching their distributions to a pre-determined one. We revisit this idea in the context of multiple, unknown source and target domains and introduce a novel Multi-domain DA layer (mDA-layer) in Section 2.4.2, which is able to normalize the multi-modal feature distributions encountered in our setting. To do this, our mDA-layers exploit a side-output branch attached to the main network (see Section 2.4.3), which predicts domain assignment probabilities for each input sample. Finally, in Section 2.4.4 we show how the predicted domain probabilities can be exploited, together with the unlabeled target samples, to construct a prior distribution over the network's parameters which is then used to define the training objective for our network.
先前的一些研究 [154,260,77,80,24,28] 已经在深度神经网络的背景下解决了相关的域适应问题,处理了源数据和目标数据在单域情况下的(i)和(ii)问题(即 ks=1kt=1)。特别是,最近的一些研究展示了一种简单而有效的方法,即使用特定的域对齐层 [29,28] 替换标准的批量归一化(BN)层。这些层通过以依赖于域的方式对特征进行归一化,将其分布与预先确定的分布相匹配,从而减少网络内不同层次的内部域偏移。我们在多个未知源域和目标域的背景下重新审视了这一想法,并在 2.4.2 节中引入了一种新颖的多域 DA 层(mDA 层),它能够对我们所面临的多模态特征分布进行归一化。为此,我们的 mDA 层利用了附加在主网络上的一个侧输出分支(见 2.4.3 节),该分支为每个输入样本预测域分配概率。最后,在 2.4.4 节中,我们展示了如何利用预测的域概率以及无标签的目标样本,构建网络参数的先验分布,然后用该先验分布来定义我们网络的训练目标。

2.4.2 Multi-domain DA-layers
2.4.2 多领域DA层


In Section 2.3, we described Domain Alignment Layers and how they are a simple yet effective solution for doman adaptation. However, applying them as described in Eq. (2.1) requires full domain knowledge,because for each domain d,μd and σd2 need to be calculated on a data sample belonging to the specific domain d . In our case, however, we do not know the domain of the source/target sample points, or we have only partial knowledge about that. To tackle this issue, we propose to model the layer's input distribution as a mixture of Gaussians, with one component per domain 3 . Specifically,we define a global input distribution qx=dπdqxd ,where πd is the probability of sampling from domain d ,and qxd=N(μd,σd2) is the domain-specific distribution for d ,namely a normal distribution with mean μd and variance σd2 . Given a mini-batch B={xi}i=1b ,a maximum likelihood estimate of the parameters μd and σd2 is given by
在2.3节中,我们介绍了领域对齐层(Domain Alignment Layers)以及它们如何成为领域自适应(domain adaptation)的一种简单而有效的解决方案。然而,按照公式(2.1)中所述应用它们需要完整的领域知识,因为对于每个领域,需要在属于特定领域d的数据样本上计算d,μdσd2。然而,在我们的情况下,我们不知道源/目标样本点所属的领域,或者我们仅掌握部分相关信息。为了解决这个问题,我们提议将该层的输入分布建模为高斯混合分布,每个领域3对应一个分量。具体来说,我们定义一个全局输入分布qx=dπdqxd,其中πd是从领域d中采样的概率,qxd=N(μd,σd2)是领域d的特定分布,即均值为μd、方差为σd2的正态分布。给定一个小批量样本B={xi}i=1b,参数μdσd2的最大似然估计为

(2.2)μd=i=1bαi,dxi,σd2=i=1bαi,d(xiμd)2,

where
其中

(2.3)αi,d=qdx(dxi)j=1bqdx(dxj),

and qdx(dxi) is the conditional probability of xi belonging to domain d ,given xi . Clearly,the value of qdx is known for all sample points for which we have domain information. In all other cases, the missing domain assignment probabilities are inferred from data, using the domain prediction network branch which will be detailed in Section 2.4.3. Thus, from the perspective of the alignment layer, these probabilities become an additional input,which we denote as wi,d for the predicted probability of xi belonging to d .
并且qdx(dxi)是在给定xi的情况下,xi属于领域d的条件概率。显然,对于所有我们掌握领域信息的样本点,qdx的值是已知的。在所有其他情况下,缺失的领域分配概率从数据中推断得出,使用将在2.4.3节详细介绍的领域预测网络分支。因此,从对齐层的角度来看,这些概率成为一个额外的输入,我们将xi属于d的预测概率表示为wi,d

By substituting wi,d for qdx(dxi) in (2.3),we obtain a new set of empirical estimates for the mixture parameters,which we denote as μ^d and σ^d2 . These parameters are used to normalize the layer's inputs according to
通过在(2.3)式中用wi,d替换qdx(dxi),我们得到了一组新的混合参数的经验估计值,我们将其表示为μ^dσ^d2。这些参数用于根据以下公式对该层的输入进行归一化

(2.4)mDA(xi,wi;μ^,σ^)=dDwi,dxiμ^dσ^d2+ϵ,

where wi={wi,d}dD,μ^={μ^d}dD,σ^={σ^d2}dD and D is the set of source/target latent domains. As in previous works [28, 29, 109], during back-propagation we calculate the derivatives through the statistics and weights, propagating the gradients to both the main input and the domain assignment probabilities.
其中wi={wi,d}dD,μ^={μ^d}dD,σ^={σ^d2}dDD是源/目标潜在领域的集合。与之前的工作[28, 29, 109]一样,在反向传播过程中,我们通过统计量和权重计算导数,并将梯度传播到主输入和领域分配概率。

2.4.3 Domain prediction
2.4.3 领域预测


Our mDA-layers receive a set of domain assignment probabilities for each input sample point, which needs to be predicted, and different mDA-layers in the network, despite having different input distributions, share consistently the same domain assignment for the sample points. As a practical example, in the typical case in which mDA-layers are used in a CNN to normalize convolutional activations, the network would predict a single set of domain assignment probabilities for each input image,which would then be fed to all mDA-layers and broadcasted across all spatial locations and feature channels corresponding to that image. We compute domain assignment probabilities using a distinct section of the network, which we call the domain prediction branch, while we refer to the main section of the network as the classification branch. The two branches share the bottom-most layers and parameters as depicted in Figure 2.2.
我们的多域自适应层(mDA-layers)会为每个需要预测的输入样本点接收一组域分配概率,并且网络中不同的多域自适应层尽管输入分布不同,但对于样本点始终共享相同的域分配。举一个实际的例子,在卷积神经网络(CNN)中使用多域自适应层对卷积激活进行归一化的典型情况下,网络会为每个输入图像预测一组单一的域分配概率,然后将其输入到所有多域自适应层,并在与该图像对应的所有空间位置和特征通道上进行广播。我们使用网络的一个独立部分(我们称之为域预测分支)来计算域分配概率,而将网络的主要部分称为分类分支。如图2.2所示,这两个分支共享最底层的层和参数。


3 Interestingly,[51] showed how a similar strategy can be effective even within a single domain.
3 有趣的是,文献[51]展示了类似的策略即使在单个域内也能有效。


The domain prediction branch is implemented as a minimal set of layers followed by two softmax operations with ks and kt outputs for the source and target latent domains, respectively (more details follow in Section 2.4.5). The rationale of keeping the domain prediction separated between source and target derives from the knowledge that we have about the source/target membership of a sample point that we receive in input, while it remains unknown the specific source or target domain it belongs to. Furthermore,for each sample point xi with known domain membership d^ ,we fix in each mDA-layer wi,d=1 if d=d^ ,otherwise wi,d=0 .
域预测分支通过最少的几层实现,随后分别对源潜在域和目标潜在域进行两次softmax操作,输出分别为kskt(更多细节见2.4.5节)。将源域和目标域的域预测分开的基本原理源于我们对输入样本点的源/目标归属的了解,而其所属的具体源域或目标域仍然未知。此外,对于每个已知域归属为d^的样本点xi,如果d=d^,我们在每个多域自适应层中固定wi,d=1,否则固定wi,d=0

We split the network into a domain prediction branch and classification branch at some low level layer. This choice is motivated by the observation [6] that features tend to become increasingly more domain invariant going deeper into the network, meaning that it becomes increasingly harder to compute a domain membership as a function of deeper features. In fact, as pointed out in [28], this phenomenon is even more evident in networks that include DA-layers.
我们在某个低层将网络划分为域预测分支和分类分支。这一选择的动机来自文献[6]的观察结果,即随着网络深度的增加,特征往往变得越来越具有域不变性,这意味着根据更深层的特征来计算域归属变得越来越困难。事实上,正如文献[28]所指出的,这种现象在包含域自适应层(DA-layers)的网络中更为明显。

2.4.4 Training the network
2.4.4 训练网络


In order to exploit unlabeled data within our discriminative setting, we follow the approach sketched in [28], where unlabeled data is used to define a regularizer over the network’s parameters. By doing so,we obtain a loss for θ that takes the following form:
为了在我们的判别式设置中利用无标签数据,我们采用文献[28]中概述的方法,其中无标签数据用于定义网络参数的正则化项。通过这样做,我们得到θ的损失函数,其形式如下:

(2.5)L(θ)=Lcls(θ)+Ldom(θ),

where Lcls is a loss term that penalizes based on the final classification task,while Ldom  accounts for the domain classification task.
其中Lcls是一个基于最终分类任务进行惩罚的损失项,而Ldom 则考虑了域分类任务。

Classification loss Lcls . The classification loss consists of two components,accounting for the the supervised sample from the source domain S and the unlabeled target sample T ,respectively:
分类损失Lcls。分类损失由两个部分组成,分别对应来自源域的有监督样本S和无标签的目标样本T

(2.6)Lcls(θ)=1ni=1nlogfCθ(yis;xis)+λCmi=1mH(fCθ(;xit)).

The first term on the right-hand-side is the average log-loss related to the supervised examples in S ,where fCθ(yis;xis) denotes the output of the classification branch of the network for a source sample,i.e. the predicted probability of xis having class yis . The second term on the right-hand-side of (2.6) is the entropy H of the classification distribution fCθ(;xit) ,averaged over all unlabeled target examples xit in T ,scaled by a positive hyperparameter λC .
等式(2.6)右侧的第一项是与S中有监督示例相关的平均对数损失,其中fCθ(yis;xis)表示网络分类分支对源样本的输出,即xis具有类别yis的预测概率。等式(2.6)右侧的第二项是分类分布fCθ(;xit)的熵H,对T中所有无标签的目标示例xit求平均,并乘以一个正的超参数λC

Domain loss Ldom  . Akin to the classification loss,the domain loss presents a component exploiting the supervision deriving from the known domain labels in S^ and a component exploiting the domain classification distribution on all sample points lacking supervision. However, the domain loss has in addition a term that tries to balance the distribution of sample points across domains, in order to avoid predictions to collapse into trivial solutions such as constant assignments to a single domain. Accordingly, the loss takes the following form:
域损失Ldom 。与分类损失类似,域损失有一个部分利用了S^中已知域标签的监督信息,还有一个部分利用了所有缺乏监督的样本点的域分类分布。然而,域损失还有一项用于平衡样本点在各个域之间的分布,以避免预测陷入诸如将所有样本都恒定分配到单个域这样的简单解。因此,损失函数的形式如下:
Ldom (θ)=λD|S^|xixS^logfDsθ(di;xi)

λBH(f¯Dsθ())+λE|SS^|xxSS^H(fDsθ(;x))

(2.7)λBH(f¯Dtθ())+λEmi=1mH(fDtθ(;xit)).

Here, fDsθ and fDtθ denote the outputs of the domain prediction branch for data points from the source and target domains,respectively,while f¯Dsθ and f¯Dtθ denote the distributions of predicted domain classes across S and T ,respectively,i.e.
在这里,fDsθfDtθ 分别表示源域和目标域数据点的域预测分支的输出,而 f¯Dsθf¯Dtθ 分别表示预测域类别在 ST 上的分布,即

f¯Dsθ(y)=1ni=1nfDsθ(y;xis),f¯Dtθ(y)=1mi=1mfDtθ(y;xit).

The first term in (2.7) enforces the correct domain prediction on the sample points with known domain and it is scaled by a positive hyperparameter λD . The terms scaled by the positive hyperparameter λE enforce domain predictions with low uncertainty for the data points with unknown domain labels, by minimizing the entropy of the output distribution. Finally, the terms scaled by the positive hyperparameter λB enforce balanced distributions of predicted domain classes across the source and target sample, by maximizing the entropy of the averaged distribution of domain predictions. Interestingly, since the classification branch has a dependence on the domain prediction branch via the mDA-layers, by optimizing the proposed loss, the network learns to predict domain assignment probabilities that result in a low classification loss. In other words, the network is free to predict domain memberships that do not necessarily reflect the real ones, as long as this helps improving its classification performance.
(2.7) 中的第一项强制对已知域的样本点进行正确的域预测,并通过正超参数 λD 进行缩放。由正超参数 λE 缩放的项通过最小化输出分布的熵,强制对未知域标签的数据点进行低不确定性的域预测。最后,由正超参数 λB 缩放的项通过最大化域预测平均分布的熵,强制源样本和目标样本上预测域类别的分布平衡。有趣的是,由于分类分支通过 mDA 层依赖于域预测分支,通过优化所提出的损失函数,网络学习预测导致低分类损失的域分配概率。换句话说,网络可以自由地预测不一定反映真实域成员关系的域成员关系,只要这有助于提高其分类性能。

We optimize the loss in (2.5) with stochastic gradient descent. Hence, the samples S,T,S^ that are considered in the computation of the gradients are restricted to a random subsets contained in the mini-batch. In Section 2.4.5 we provide more details on how each mini-batch is sampled. We call our model multi-Domain Alignment layers for latent domain discovery (mDA).
我们使用随机梯度下降法优化 (2.5) 中的损失函数。因此,在计算梯度时考虑的样本 S,T,S^ 仅限于小批量中包含的随机子集。在 2.4.5 节中,我们将详细介绍每个小批量是如何采样的。我们将我们的模型称为用于潜在域发现的多域对齐层 (mDA)。

2.4.5 Experimental results
2.4.5 实验结果


Datasets
数据集


In our evaluation we consider several common DA benchmarks: the combination of USPS [72], MNIST [131] and MNIST-m [77]; the Digits-five benchmark in [286]; Office-31 [228]; Office-Caltech [86] and PACS [133].
在我们的评估中,我们考虑了几个常见的域适应 (DA) 基准:美国邮政服务数据集 (USPS) [72]、手写数字数据集 (MNIST) [131] 和 MNIST-m [77] 的组合;文献 [286] 中的五数字基准;Office-31 [228];Office-Caltech [86] 和 PACS [133]。
MNIST, MNIST-m and USPS are three standard datasets for digits recognition. USPS [72] is a dataset of digits scanned from U.S. envelopes, MNIST [131] is a popular benchmark for digits recognition and MNIST-m [77] its counterpart obtained by blending the original images with colored patches extracted from BSD500 photos [9]. Due to their different representations (e.g. colored vs gray-scale), these datasets have been adopted as a DA benchmark by many previous works [77, 24, 22]. Here, we consider a multi source DA setting,using MNIST and MNIST-m as sources and USPS as target, training on the union of the training sets and testing on the test set of USPS.
MNIST、MNIST-m 和 USPS 是三个用于数字识别的标准数据集。USPS [72] 是一个从美国信封上扫描的数字数据集,MNIST [131] 是一个流行的数字识别基准,MNIST-m [77] 是通过将原始图像与从 BSD500 照片 [9] 中提取的彩色补丁混合而得到的对应数据集。由于它们的表示方式不同(例如彩色与灰度),这些数据集已被许多先前的工作 [77, 24, 22] 用作域适应基准。在这里,我们考虑多源域适应设置,使用 MNIST 和 MNIST-m 作为源域,USPS 作为目标域,在训练集的并集上进行训练,并在 USPS 的测试集上进行测试。

Digits-five is an experimental setting proposed in [286] which considers 5 datasets of digits recognition. In addition to MNIST, MNST-m and USPS, it includes SVHN [189] and Synthetic numbers datasets [78]. SVHN [189] contains pictures of real-world house numbers, collected from Google Street View. Synthetic numbers [78] is built from computer generated digits, including multiple sources of variations (i.e. position, orientation, background, color and amount of blur), for a total of 500 thousands images. We follow the experimental setting described in [286]: the train/test split comprises a subset of 25000 images for training and 9000 for testing for each of the domains, except for USPS for which the entire dataset is used. As in [286], we report the results when either SVHN or MNIST-m are used as targets and all the other domains are taken as sources.
五数字 (Digits-five) 是文献 [286] 中提出的一个实验设置,它考虑了 5 个数字识别数据集。除了 MNIST、MNST-m 和 USPS 之外,它还包括街景门牌号数据集 (SVHN) [189] 和合成数字数据集 [78]。SVHN [189] 包含从谷歌街景收集的真实世界门牌号图片。合成数字数据集 [78] 由计算机生成的数字构建而成,包括多种变化来源(即位置、方向、背景、颜色和模糊程度),总共包含 50 万张图像。我们遵循文献 [286] 中描述的实验设置:训练/测试分割包括每个域的 25000 张图像用于训练,9000 张图像用于测试,但 USPS 使用整个数据集。与文献 [286] 一样,我们报告了当 SVHN 或 MNIST-m 用作目标域,而所有其他域用作源域时的结果。

Office-31 is a standard DA benchmark which contains images of 31 object categories collected from 3 different sources: Webcam (W), DSLR camera (D) and the Amazon website (A). Following [283], we perform our tests in the multi-source setting, where each domain is in turn considered as target, while the others are used as source.
Office-31 是一个标准的域适应基准,它包含从 3 个不同来源收集的 31 个对象类别的图像:网络摄像头 (W)、数码单反相机 (D) 和亚马逊网站 (A)。遵循文献 [283],我们在多源设置下进行测试,其中每个域依次被视为目标域,而其他域用作源域。

Office-Caltech [86] is obtained by selecting the subset of 10 common categories in the Office31 and the Caltech256 [93] datasets. It contains 2533 images, about half of which belong to Caltech256. The different domains are Amazon (A), DSLR (D), Webcam (W) and Caltech256 (C). In our experiments we consider the set of source/target combinations used in [85].
Office-Caltech [86] 是通过选择 Office31 和加州理工学院图像数据集 (Caltech256) [93] 中 10 个常见类别的子集获得的。它包含 2533 张图像,其中约一半属于 Caltech256。不同的域包括亚马逊 (A)、数码单反相机 (D)、网络摄像头 (W) 和 Caltech256 (C)。在我们的实验中,我们考虑了文献 [85] 中使用的源/目标组合集。

PACS [133] is a recently proposed DA benchmark which is especially interesting due to the significant domain shift between its domains. It contains images of 7 categories (dog, elephant, giraffe, guitar, horse) and 4 different visual styles: i.e. Photo (P), Art paintings (A), Cartoon (C) and Sketch (S). We employ the dataset in two different settings. First, following the experimental protocol in [133], we train our model considering 3 domains as sources and the remaining as target, using all the images of each domain. Differently from [133] we consider a DA setting (i.e. target data is available at training time) and we do not address the problem of domain generalization. Second, we use 2 domains as sources and the remaining 2 as targets, in a multi-source multi-target scenario. In this setting the results are reported as average accuracy between the 2 target domains.
PACS [133]是最近提出的一个领域自适应(DA)基准,由于其不同领域之间存在显著的领域偏移,因而特别值得关注。它包含7个类别(狗、大象、长颈鹿、吉他、马)的图像,以及4种不同的视觉风格,即照片(P)、艺术绘画(A)、卡通(C)和素描(S)。我们在两种不同的设置下使用该数据集。首先,按照文献[133]中的实验方案,我们将3个领域作为源领域,其余作为目标领域,使用每个领域的所有图像来训练我们的模型。与文献[133]不同的是,我们考虑的是领域自适应设置(即在训练时可以获取目标数据),并且不处理领域泛化问题。其次,在多源多目标场景中,我们使用2个领域作为源领域,其余2个作为目标领域。在这种设置下,结果以2个目标领域的平均准确率来报告。

In all experiments and settings,we assume to have no domain labels (i.e. S^= ), unless otherwise stated.
在所有实验和设置中,除非另有说明,我们假设没有领域标签(即S^= )。

Networks and training protocols
网络与训练方案


We apply our approach to four different CNN architectures: the MNIST and SVHN networks described in [77, 78], AlexNet [124] and ResNet [98]. We choose AlexNet due to its widespread use in many relevant DA works [77, 28, 154, 155], while ResNet is taken as an exemplar for modern state-of-the-art architectures employing batch-normalization layers. Both AlexNet and ResNet are first pre-trained on ImageNet and then fine-tuned on the datasets of interest. The MNIST and SVHN architectures are chosen for fair comparison with previous works considering digits datasets [78,286] . Unless otherwise noted,we optimize our networks using Stochastic Gradient Descent with momentum 0.9 and weight decay 5×104 .
我们将我们的方法应用于四种不同的卷积神经网络(CNN)架构:文献[77, 78]中描述的MNIST和SVHN网络、AlexNet [124]和ResNet [98]。我们选择AlexNet是因为它在许多相关的领域自适应工作[77, 28, 154, 155]中被广泛使用,而ResNet则被选为采用批量归一化层的现代先进架构的范例。AlexNet和ResNet都首先在ImageNet上进行预训练,然后在感兴趣的数据集上进行微调。选择MNIST和SVHN架构是为了与之前考虑数字数据集[78,286] 的工作进行公平比较。除非另有说明,我们使用动量为0.9且权重衰减为5×104 的随机梯度下降法来优化我们的网络。

For the evaluation on MNIST, MNIST-m and USPS datasets, we employ the MNIST network described in [77], adding an mDA-layer after each convolutional and fully-connected layer. The domain prediction branch is attached to the output of conv1, and is composed of a convolution with the same meta-parameters as conv2, a global average pooling, a fully-connected layer with 100 output channels and finally a fully-connected classifier. Following the protocol described in [28, 77], we set the initial learning rate l0 to 0.01 and we anneal it through a schedule lp defined by lp=l0(1+γp)β where β=0.75,γ=10 and p is the training progress increasing linearly from 0 to 1 . We rescale the input images to 32×32 pixels,subtract the per-pixel image mean of the dataset and feed the networks with random crops of size 28×28 . A batch size of 128 images per domain is used.
对于在MNIST、MNIST - m和USPS数据集上的评估,我们采用文献[77]中描述的MNIST网络,在每个卷积层和全连接层之后添加一个多领域自适应(mDA)层。领域预测分支连接到conv1的输出,由一个与conv2具有相同元参数的卷积层、一个全局平均池化层、一个具有100个输出通道的全连接层以及最后一个全连接分类器组成。按照文献[28, 77]中描述的方案,我们将初始学习率l0 设置为0.01,并通过一个由lp=l0(1+γp)β 定义的调度lp 对其进行退火处理,其中β=0.75,γ=10p 是从0线性增加到1的训练进度。我们将输入图像重新缩放为32×32 像素,减去数据集的每个像素的图像均值,并将大小为28×28 的随机裁剪图像输入网络。每个领域使用128张图像的批量大小。

For the Digits-five experiments we employ the SVHN architecture of [78], which is the same architecture adopted by [286], augmented with mDA-layers and a domain prediction branch in the same way as the MNIST network described in the previous paragraph. We train the architecture for 44000 iterations, with a batch size of 32 images per domain,an initial learning rate of 104 which is decayed by a factor of 10 after 80% of the training process. We use Adam as optimizer with a weight decay 5×105 ,and pre-process the input images like in the MNIST,MNIST-m,USPS experiments.
对于Digits - five实验,我们采用文献[78]中的SVHN架构,该架构与文献[286]采用的架构相同,并以与上一段中描述的MNIST网络相同的方式增加了mDA层和领域预测分支。我们对该架构进行44000次迭代训练,每个领域的批量大小为32张图像,初始学习率为104 ,在训练过程的80% 之后学习率衰减为原来的1/10。我们使用Adam作为优化器,权重衰减为5×105 ,并像在MNIST、MNIST - m、USPS实验中一样对输入图像进行预处理。

For the experiments on Office-31 and Office-Caltech we employ the AlexNet architecture. We follow a setup similar to the one proposed in [28,29] ,fixing the parameters of all convolutional layers and inserting mDA-layers after each fully-connected layer and before their corresponding activation functions. The domain prediction branch is attached to the last pooling layer pool5, and is composed of a global average pooling, followed by a fully connected classifier to produce the final domain probabilities. The training schedule and hyperparameters are set following [28].
对于在Office - 31和Office - Caltech上的实验,我们采用AlexNet架构。我们遵循与文献[28,29] 中提出的设置类似的设置,固定所有卷积层的参数,并在每个全连接层之后以及其相应的激活函数之前插入mDA层。领域预测分支连接到最后一个池化层pool5,由一个全局平均池化层和一个全连接分类器组成,以产生最终的领域概率。训练调度和超参数按照文献[28]进行设置。

For the experiments on the PACS dataset we consider the ResNet architecture in the 18-layers setup described in [98], denoted as ResNet18. This architecture comprises an initial 7×7 convolution,denoted as conv1,followed by 4 main modules, denoted as conv2 - conv5, each containing two residual blocks. To apply our approach, we replace each Batch Normalization layer in the residual blocks of the network with an mDA-layer. The domain prediction branch is attached to conv1, after the pooling operation. The branch is composed of a residual block with the same structure as conv2, followed by global average pooling and a fully connected classifier. In the multi-target experiments we add a second, identical domain prediction branch to discriminate between target domains. We also add a standard BN layer after the final domain classifiers, which we found leads to a more stable training process in the multi-target case. In both cases, we adopt the same training meta-parameters as for AlexNet,with the exception of weight-decay which is set to 106 and learning rate which is set to 5104 . The network is trained for 600 iterations with a batch size of 48 , equally divided between the domains, and the learning rate is scaled by a factor 0.1 after 75% of the iterations.
在PACS数据集上进行实验时,我们采用文献[98]中描述的18层ResNet架构(表示为ResNet18)。该架构包含一个初始的7×7卷积层(表示为conv1),其后是4个主要模块(表示为conv2 - conv5),每个模块包含两个残差块。为了应用我们的方法,我们将网络残差块中的每个批量归一化(Batch Normalization)层替换为mDA层。域预测分支连接到conv1的池化操作之后。该分支由一个与conv2结构相同的残差块、全局平均池化层和一个全连接分类器组成。在多目标实验中,我们添加了第二个相同的域预测分支,用于区分目标域。我们还在最终的域分类器之后添加了一个标准的批量归一化(BN)层,我们发现这在多目标情况下能使训练过程更加稳定。在这两种情况下,我们采用与AlexNet相同的训练元参数,但权重衰减(weight - decay)设置为106,学习率设置为5104。网络训练600次迭代,批量大小为48,在各个域之间平均分配,并且在75%的迭代次数后,学习率缩小为原来的0.1倍。
Regarding the hyperparameters of our method, we set the number of source domains k equal to Q1 ,where Q is the number of different datasets used in each single experiment. In the multi-source multi-target scenarios, since we always have the domains equally split between source and target,we consider k equal Q/2 for both source and target. Following [28], in the experiments with AlexNet we fix λC=λE=0.2 with λB=0.1 . Similarly,for the experiments on digits classification, we set λC=λE=0.1 and λB=0.05 for MNIST,MNIST-m and USPS,and λC=0.01 and λE=λB=0.05 for Digits-five,with λE=0.01 if λB=0 ,which we found leading to a more stable minimization of the loss of the domain branch. In the experiments involving ResNet18 we select the values λC=0.1 and λE=λB=0.0001 through cross-validation, following the procedure adopted in [153, 28]. Similarly, in the multi-target ResNet 18 experiments we select λC=λE=λB=0.1 . When domain labels are available for a subset of source samples,we fix λD=0.5 .
关于我们方法的超参数,我们将源域的数量k设置为Q1,其中Q是每个单独实验中使用的不同数据集的数量。在多源多目标场景中,由于我们总是将域平均分配到源域和目标域,因此我们认为源域和目标域的k都等于Q/2。遵循文献[28],在使用AlexNet的实验中,我们将λC=λE=0.2固定为λB=0.1。类似地,在数字分类实验中,对于MNIST、MNIST - m和USPS数据集,我们设置λC=λE=0.1λB=0.05;对于Digits - five数据集,我们设置λC=0.01λE=λB=0.05,并且当λB=0时,λE=0.01,我们发现这样能使域分支的损失最小化过程更加稳定。在涉及ResNet18的实验中,我们按照文献[153, 28]中采用的方法,通过交叉验证选择λC=0.1λE=λB=0.0001的值。类似地,在多目标ResNet 18实验中,我们选择λC=λE=λB=0.1。当源样本的一个子集有域标签可用时,我们固定λD=0.5

We implement 4 all the models with the Caffe [111] framework and our evaluation is performed using an NVIDIA GeForce 1070 GTX GPU. We initialize both AlexNet and ResNet18 from models pre-trained on ImageNet, taking AlexNet from the Caffe model zoo,and converting ResNet18 from the original Torch model 5 . For all the networks and experiments,we add mDA layers and their variants in place of standard BN layers.
我们使用Caffe [111]框架实现4所有模型,并使用NVIDIA GeForce 1070 GTX GPU进行评估。我们从在ImageNet上预训练的模型初始化AlexNet和ResNet18,从Caffe模型库中获取AlexNet,并将ResNet18从原始的Torch模型5进行转换。对于所有网络和实验,我们用mDA层及其变体替换标准的批量归一化(BN)层。

Results
结果


In this section, we first analyze the proposed approach, demonstrating the advantages of considering multiple sources/targets and discovering latent domains. We then compare the proposed method with state-of-the-art approaches. For all the experiments we report the results in terms of accuracy, repeating the experiments at least 5 times and averaging the results. In the multi-target experiments, the reported accuracy is the average of the accuracies over the target domains. As for standard deviations, since we do not tune the hyperparameters of our model and baselines by employing the accuracy on the target domain, their values can be high in some settings. For this reason, in order to provide a more appropriate analysis of the significance of our results, we propose to adopt the following approach. In particular, let us model the accuracy of an algorithm as a random variable Xa with unknown distribution. The accuracy of a single run of the algorithm is an observation from this distribution. Therefore, in order to compare two algorithms we consider the two sets of associated observations A={a1,,an} and B={b1,,bm} and estimate the probability that one algorithm is better than the other as:
在本节中,我们首先分析所提出的方法,展示考虑多个源域/目标域以及发现潜在域的优势。然后,我们将所提出的方法与最先进的方法进行比较。对于所有实验,我们以准确率的形式报告结果,将实验重复至少5次并对结果取平均值。在多目标实验中,报告的准确率是目标域上准确率的平均值。至于标准差,由于我们没有通过使用目标域上的准确率来调整模型和基线的超参数,因此在某些设置下其值可能会很高。出于这个原因,为了对我们的结果的显著性进行更恰当的分析,我们建议采用以下方法。具体来说,我们将算法的准确率建模为一个具有未知分布的随机变量Xa。算法单次运行的准确率是该分布的一个观测值。因此,为了比较两种算法,我们考虑两组相关的观测值A={a1,,an}B={b1,,bm},并估计一种算法优于另一种算法的概率为:

4 Code available at: https://github.com/mancinimassimiliano/latent_domains_DA.git
4 代码可在以下链接获取:https://github.com/mancinimassimiliano/latent_domains_DA.git

5 https://github.com/HolmesShuan/ResNet-18-Caffemodel-on-ImageNet
5 https://github.com/HolmesShuan/ResNet - 18 - Caffemodel - on - ImageNet


p(Xa>Xb)=aAbBδ(a>b)|A||B|

where δ is the Dirac function. In the following we use this metric to compare our approach with respect to a baseline where no latent domain discovery process is implemented (specifically, the method DIAL [29], see below) considering five runs for each experiment. For sake of clarity,we denote this probability estimate as p .
其中δ是狄拉克函数。在接下来的内容中,我们使用这个指标,针对每个实验进行五次运行,将我们的方法与未实施潜在域发现过程的基线方法(具体来说,是DIAL方法[29],见下文)进行比较。为了清晰起见,我们将这个概率估计表示为p

In the following we first analyze the performances of the proposed approach with λB=0 (denoted as mDA λB=0 ),i.e. the algorithm we presented in [169],and then we describe the impact of the loss term we introduce in this section setting λB>0 (denoted simply as mDA).
在接下来的内容中,我们首先分析具有λB=0的所提出方法(表示为mDA λB=0)的性能,即我们在文献[169]中提出的算法,然后描述我们在本节中引入的损失项在设置λB>0(简称为mDA)时的影响。

Experiments on the Digits datasets
数字数据集上的实验


In a first series of experiments, reported in Table 2.1, we test the performance of our approach on the MNIST, MNIST-m to USPS benchmark (M-Mm to U). The comparison includes: (i) the baseline network trained on the union of all source domains (Unified sources); (ii) training separate networks for each source, and selecting the one the performs the best on the target (Best single source); (iii) DIAL [29], trained on the union of the sources (DIAL [29] - Unified sources); (iv) DIAL, trained separately on each source and selecting the best performing model on the target (DIAL [29] - Best single source). We also report the results of our approach in the ideal case where the multiple source domains are known and we do not need to discover them (Multi-source DA ). For our approach with λB=0 ,we consider several different values of k ,i.e. the number of discovered source domains.
在表2.1中报告的第一系列实验中,我们在MNIST、MNIST - m到USPS基准测试(M - Mm到U)上测试了我们方法的性能。比较内容包括:(i)在所有源域的并集上训练的基线网络(统一源域);(ii)为每个源域训练单独的网络,并选择在目标域上表现最佳的网络(最佳单源域);(iii)在源域的并集上训练的DIAL [29](DIAL [29] - 统一源域);(iv)分别在每个源域上训练DIAL,并选择在目标域上表现最佳的模型(DIAL [29] - 最佳单源域)。我们还报告了在已知多个源域且无需发现它们的理想情况下我们方法的结果(多源域DA)。对于具有λB=0的我们的方法,我们考虑了k的几个不同值,即发现的源域数量。

By looking at the table several observations can be made. First, there is a large performance gap between models trained only on source data and DA methods, confirming that deep architectures by themselves are not enough to solve the domain shift problem [59]. Second, in analogy with previous works on DA [174, 60, 252], we found that considering multiple sources is beneficial for reducing the domain shift with respect to learning a model on the unified source set. Finally, and more importantly, when the domain labels are not available, our approach is successful in discovering latent domains and in exploiting this information for improving accuracy on target data, partially filling the performance gap between the single source models and Multi-source DA. Interestingly, the performance of our algorithm changes only slightly for different values of k ,motivating our choice to always fix k to the known number of domains in the next experiments. Importantly, comparing our approach with DIAL we achieve higher accuracy in most of the runs,i.e. p=0.65 . In this experiment, the introduction of the loss term forcing a uniform assignment among clusters (denoted as mDA) leads to comparable performances to our method with λB=0 . This behaviour can be ascribed to the fact that the separation among different domains is quite clear in this case and adding constraints to the domain discovery process is not required. In the following, we show that the proposed loss is beneficial in more challenging datasets.
通过查看表格,可以得出几点观察结果。首先,仅在源数据上训练的模型与域自适应(Domain Adaptation,DA)方法之间存在较大的性能差距,这证实了仅靠深度架构不足以解决域偏移问题 [59]。其次,与之前关于域自适应的研究 [174, 60, 252] 类似,我们发现考虑多个源数据有助于减少相对于在统一源数据集上学习模型时的域偏移。最后,更重要的是,当域标签不可用时,我们的方法能够成功发现潜在的域,并利用这些信息提高目标数据的准确性,部分填补了单源模型和多源域自适应之间的性能差距。有趣的是,我们的算法性能对于不同的 k 值仅有轻微变化,这促使我们在接下来的实验中始终将 k 固定为已知的域数量。重要的是,将我们的方法与 DIAL 进行比较,在大多数运行中我们都取得了更高的准确率,即 p=0.65。在这个实验中,引入强制聚类间均匀分配的损失项(表示为 mDA)所获得的性能与我们使用 λB=0 的方法相当。这种现象可以归因于在这种情况下不同域之间的分离非常明显,因此不需要对域发现过程添加约束。接下来,我们将展示所提出的损失函数在更具挑战性的数据集上是有益的。

In a second set of experiments (Table 2.2), we compare our approach with previous and recently proposed single and multi-source unsupervised DA approaches. Following [286], we perform experiments on the Digits-five dataset, considering two settings with SVHN and MNIST-m as targets. As in the previous case, we evaluate the performance of the baseline network (with and without BN layers) and of DIAL when trained on the union of the sources, and, as an upper bound, our Multi-source DA with perfect domain knowledge. Moreover, we consider the Deep Cocktail Network (DCTN) [286] multi-source DA model, as well as the "source only" baseline and the single source DA models reported in [286]: Reverse gradient (RevGrad) [77] and Domain Adaptation Networks (DAN) [154]. For all single source DA models we consider two settings: "Unified Sources", where all source domains are merged, and "Multi-Source", where a separate model is trained for each source domain, and the final prediction is computed as an ensemble. As we can see, the Unified Sources DIAL already achieves remarkable results in this setting, outperforming DCTN, and Multi-source DA only provides a modest performance increase. As expected, the performance of our approach lies between these two (p equal to 0.56 and 0.64 for SVHN and MNIST-m respectively,with λB=0 ).
在第二组实验中(表 2.2),我们将我们的方法与之前和最近提出的单源和多源无监督域自适应方法进行了比较。按照文献 [286] 的方法,我们在 Digits - five 数据集上进行实验,考虑以 SVHN 和 MNIST - m 为目标的两种设置。与之前的情况一样,我们评估了基线网络(有和没有批量归一化(Batch Normalization,BN)层)以及在源数据的并集上训练的 DIAL 的性能,并且作为上限,我们还评估了具有完美域知识的多源域自适应方法的性能。此外,我们考虑了深度混合网络(Deep Cocktail Network,DCTN)[286] 多源域自适应模型,以及 [286] 中报告的“仅源数据”基线和单源域自适应模型:反向梯度(Reverse gradient,RevGrad)[77] 和域自适应网络(Domain Adaptation Networks,DAN)[154]。对于所有单源域自适应模型,我们考虑两种设置:“统一源数据”,即将所有源域合并;以及“多源数据”,即为每个源域训练一个单独的模型,并将最终预测作为集成结果。正如我们所见,在这种设置下,统一源数据的 DIAL 已经取得了显著的结果,优于 DCTN,而多源域自适应仅带来了适度的性能提升。正如预期的那样,我们的方法的性能介于这两者之间(对于 SVHN 和 MNIST - m,(p 分别等于 0.56 和 0.64,且 λB=0)。

Table 2.1. Digits datasets: comparison of different models in the multi-source scenario. MNIST (M) and MNIST-m (Mm) are taken as source domains, USPS (U) as target.
表 2.1. 数字数据集:多源场景下不同模型的比较。以 MNIST(M)和 MNIST - m(Mm)作为源域,USPS(U)作为目标域。

MethodM-Mm to U
Unified sources57.1
Best single source59.8
DIAL [29] - Unified sources81.7
DIAL [29] - Best single source81.9
mDAλB=0k=282.5
mDAλB=0k=382.2
mDAλB=0k=482.7
mDAλB=0k=582.4
mDA (k=2)82.4
Multi-source DA84.2
方法M - 毫米到U
统一源57.1
最佳单一源59.8
DIAL [29] - 统一源81.7
DIAL [29] - 最佳单一源81.9
mDAλB=0k=282.5
mDAλB=0k=382.2
mDAλB=0k=482.7
mDAλB=0k=582.4
mDA (k=2)82.4
多源领域适应(Multi - source Domain Adaptation)84.2


Experiments on PACS
在PACS数据集上的实验


Comparison with state of the art. In our main PACS experiments we compare the proposed approach with the baseline ResNet18 network, and with ResNet18 + DIAL [29], both trained on the union of source sets. As in the digits experiments, we also report the performance of our method when perfect domain knowledge is available (Multi-source DA). Table 2.3 shows our results. In general, DA models are especially beneficial when considering the PACS dataset, and multi-source DA networks significantly outperform the single source one. Remarkably, our model is able to infer domain information automatically without supervision. In fact, its accuracy is either comparable with Multi-source DA (Photo, Art and Cartoon) or in between DIAL and Multi-source DA (Sketch). The average p is 0.67 . Looking at the partial results, it is interesting to note that the improvements of our approach and Multi-source DA w.r.t. DIAL are more significant when either the Sketch or the Cartoon domains are employed as target set (average p=0.81 ). Since these domains are less represented in the ImageNet database, we believe that the corresponding features derived from the pre-trained model are less discriminative, and DA methods based on multiple sources become more effective. Setting λB>0 , allows to obtain a further boost of performances in the Sketch scenario, where the source domains are closer in appearances. In the other settings, the domain shift is mostly among the Sketch domain and all the others and it can be easily captured by our original formulation in [169].
与现有技术的比较。在我们主要的PACS实验中,我们将所提出的方法与基线ResNet18网络以及ResNet18 + DIAL [29]进行比较,这两者均在源数据集的并集上进行训练。与数字实验一样,我们还报告了在拥有完美领域知识时我们方法的性能(多源域适应,Multi - source DA)。表2.3展示了我们的实验结果。总体而言,在考虑PACS数据集时,域适应(DA)模型特别有用,并且多源域适应网络的性能明显优于单源域适应网络。值得注意的是,我们的模型能够在无监督的情况下自动推断出领域信息。实际上,其准确率要么与多源域适应(照片、艺术和卡通领域)相当,要么介于DIAL和多源域适应(素描领域)之间。平均p为0.67。从部分结果来看,有趣的是,当以素描或卡通领域作为目标集时,我们的方法和多源域适应相对于DIAL的改进更为显著(平均p=0.81)。由于这些领域在ImageNet数据库中的代表性较低,我们认为从预训练模型中提取的相应特征的区分度较低,而基于多源的域适应方法变得更加有效。设置λB>0,可以在素描场景中进一步提升性能,在该场景中源领域在外观上更为接近。在其他设置中,领域偏移主要存在于素描领域与其他所有领域之间,并且可以通过我们在文献[169]中的原始公式轻松捕捉到。

Table 2.2. Digits-five [286] setting, comparison of different single source and multi-source DA models. The first row indicates the target domain with the others used as sources.
表2.2. 数字五(Digits - five)[286]设置,不同单源和多源域适应模型的比较。第一行表示目标领域,其他行表示源领域。

MethodSVHNMNIST-mMean
Unified sourcesSource only)74.164.469.3
Source only +BN77.759.468.6
Source only from [286]72.264.168.2
RevGrad [77]68.971.670.3
DAN [154]71.066.668.8
DIAL [29]82.268.875.5
mDAλB=082.469.175.8
mDA82.670.176.4
Multi-sourceOnly Source [286]64.660.762.7
RevGRAD [77]61.471.166.3
DAN [154]62.962.662.8
DCTN [286]77.570.974.2
Multi-source DA84.169.476.8
方法街景门牌号数据集(Street View House Numbers, SVHN)MNIST-m数据集均值
统一源仅源域)74.164.469.3
仅源域 +BN77.759.468.6
仅来自[286]的源域72.264.168.2
反向梯度(RevGrad) [77]68.971.670.3
深度平均网络(Deep Averaging Network, DAN) [154]71.066.668.8
DIAL方法 [29]82.268.875.5
mDAλB=082.469.175.8
多层去噪自编码器(Multi-layer Denoising Autoencoder, mDA)82.670.176.4
多源仅源域 [286]64.660.762.7
反向梯度(RevGRAD) [77]61.471.166.3
深度平均网络(Deep Averaging Network, DAN) [154]62.962.662.8
深度协同训练网络(Deep Co-Training Network, DCTN) [286]77.570.974.2
多源领域自适应(Multi-source Domain Adaptation, Multi-source DA)84.169.476.8

Table 2.3. PACS dataset: comparison of different methods using the ResNet architecture. The first row indicates the target domain, while all the others are considered as sources.
表2.3. PACS数据集:使用ResNet架构的不同方法的比较。第一行表示目标域,而其他所有行都被视为源域。

MethodSketchPhotoArtCartoonMean
ResNet [98]60.192.974.772.475.0
DIAL [29]66.897.087.385.584.2
mDAλB=069.697.087.786.985.3
mDA70.797.087.486.385.4
Multi-source DA71.696.687.587.085.7
方法草图照片艺术卡通均值
残差网络(ResNet) [98]60.192.974.772.475.0
DIAL [29]66.897.087.385.584.2
mDAλB=069.697.087.786.985.3
多重去噪自编码器(mDA)70.797.087.486.385.4
多源领域自适应(Multi - source DA)71.696.687.587.085.7


To analyze the performances of our approach in a multi-source multi-target scenario, we perform a second set of experiments on the PACS dataset considering 2 domains as sources and the other 2 as targets. The results, shown in Table 2.4, comprise the same baselines as in Table 2.3. Note that, apart from the difficulty of providing useful domain assignments both in the source and target sets during training, the domain prediction step is required even at test time, thus having a larger impact on the final performances of the model. The performance gap between DIAL and our approach increases in this setting compared to Table 2.3. Our hypothesis is that not accounting for multiple domains has a larger impact on the unlabeled target than on the labeled source. Looking at the partial results, when Photo is considered as one of the target domains there are no particular differences in the final performances of the various DA models: this may be caused by the bias of the pre-trained network towards this domain. However, when the other domains are considered as targets, the gain in performances produced by our model are remarkable. When Sketch is one of the target domains, our model completely fills the gap between the unified source/target DA method and the multi-source multi-target upper bound with a gain of more then 7% when Art and Cartoon considered as other target. Setting λB>0 in this setting allows to obtain a further boost of performances. This is evident in the scenario where Photo and Art are both the source or target domains, with Cartoon-Sketch correspond to the other pair. In this scenario the source/target pairs are quite close and enforcing a uniform assignment among the latent domains provides a better estimate of each of them.
为了分析我们的方法在多源多目标场景中的性能,我们在PACS数据集上进行了第二组实验,将2个领域作为源领域,另外2个作为目标领域。结果如表2.4所示,包含与表2.3相同的基线。请注意,除了在训练期间难以在源集和目标集中提供有用的领域分配之外,即使在测试时也需要进行领域预测步骤,因此对模型的最终性能有更大的影响。与表2.3相比,在这种设置下DIAL(领域无关对抗学习,Domain-Invariant Adversarial Learning)和我们的方法之间的性能差距有所增加。我们的假设是,不考虑多个领域对未标记的目标领域的影响比对标记的源领域的影响更大。从部分结果来看,当将“照片(Photo)”视为目标领域之一时,各种领域适应(Domain Adaptation,DA)模型的最终性能没有特别的差异:这可能是由于预训练网络对该领域存在偏差。然而,当将其他领域视为目标领域时,我们的模型所带来的性能提升非常显著。当“素描(Sketch)”是目标领域之一时,当将“艺术(Art)”和“卡通(Cartoon)”视为其他目标领域时,我们的模型完全填补了统一源/目标领域适应方法与多源多目标上限之间的差距,性能提升超过7%。在这种设置下设置λB>0可以进一步提升性能。在“照片(Photo)”和“艺术(Art)”同为源领域或目标领域,“卡通(Cartoon)”-“素描(Sketch)”为另一对的场景中,这一点很明显。在这种场景中,源/目标对非常接近,并且在潜在领域之间强制进行统一分配可以更好地估计每个领域。

Table 2.4. PACS dataset: comparison of different methods using the ResNet architecture on the multi-source multi-target setting. The first row indicates the two target domains.
表2.4. PACS数据集:在多源多目标设置下使用ResNet架构的不同方法的比较。第一行表示两个目标领域。

Method/TargetsPhoto ArtPhoto CartoonPhoto SketchArt CartoonArt SketchCartoon SketchMean
ResNet [98]71.484.281.462.270.354.270.6
DIAL [29]86.786.586.877.172.167.779.5
Random assignment86.686.785.976.269.169.479.1
mDAλE=λB=086.886.586.778.673.868.780.2
mDAλB=λC=082.485.083.771.774.068.876.4
mDAλB=086.187.987.979.379.974.982.6
mDA87.288.188.777.781.377.083.3
Multi-source/target DA87.788.986.879.079.875.683.0
方法/目标摄影艺术摄影卡通摄影素描艺术卡通艺术素描卡通素描均值
残差网络(ResNet) [98]71.484.281.462.270.354.270.6
对话式交互自适应学习(DIAL) [29]86.786.586.877.172.167.779.5
随机分配86.686.785.976.269.169.479.1
mDAλE=λB=086.886.586.778.673.868.780.2
mDAλB=λC=082.485.083.771.774.068.876.4
mDAλB=086.187.987.979.379.974.982.6
多去噪自编码器(mDA)87.288.188.777.781.377.083.3
多源/目标领域自适应(Multi-source/target DA)87.788.986.879.079.875.683.0


Ablation study. We exploit the challenging multisource-multitarget scenario of Table 2.4 in order to assess the impact of the various components of our algorithm. In particular we show how the performance are affected if (i) a random domain is assigned to each sample; (ii) no loss is applied to the domain prediction branch; (iii) no entropy loss is applied to the classification of unlabeled target samples. From Table 2.4 we can easily notice that if we drop either the domain prediction branch (random assignment) or the losses on top of it (λE=λB=0) ,the performances of the model become comparable to the ones obtain by the DIAL baseline. This shows not only the importance of discovering latent domains, but also that both the domain branch and our losses allow to extract meaningful subsets from the data. Moreover, this demonstrates the fact that our improvements are not only due to the introduction of multiple normalization layers, but also to the latent domain discovering procedure. For what concerns the classification branch, without the entropy component on unlabelled target samples (λC=0) ,the performance of the model significantly decreases (i.e. from 82.6 to 76.4 in average). This confirms the findings of previous works [28,29] about the impact that this loss for normalization based DA approaches. In particular, assuming that source and target samples of different domains are independently normalized, the entropy loss generates a gradient flow through unlabeled samples based in the direction of its most confident prediction. This is particularly important to learn useful features even for the target domain/s, for which no supervision is available.
消融研究。我们利用表2.4中具有挑战性的多源多目标场景,以评估我们算法各组件的影响。具体而言,我们展示了在以下情况下性能会受到怎样的影响:(i)为每个样本随机分配一个域;(ii)不对域预测分支应用损失函数;(iii)不对无标签目标样本的分类应用熵损失。从表2.4中我们可以轻易注意到,如果我们去掉域预测分支(随机分配)或其上方的损失函数(λE=λB=0),模型的性能将与DIAL基线方法所获得的性能相当。这不仅表明了发现潜在域的重要性,还表明域分支和我们的损失函数都能从数据中提取有意义的子集。此外,这证明了我们的改进不仅归功于引入多个归一化层,还归功于潜在域发现过程。对于分类分支而言,如果在无标签目标样本上没有熵组件(λC=0),模型的性能会显著下降(即平均从82.6降至76.4)。这证实了先前研究[28,29]关于基于归一化的领域自适应(DA)方法中该损失函数影响的发现。具体来说,假设不同域的源样本和目标样本是独立归一化的,熵损失会基于最有信心的预测方向,通过无标签样本产生梯度流。这对于即使在没有监督的目标域中学习有用特征也尤为重要。

In-depth analysis. The ability of our approach to discover latent domains is further investigated on PACS. First, in Figure 2.3, we show how our approach assigns source samples to different latent domains in the single target setting. The four plots correspond to a single run of the experiments of Table 2.3. Interestingly, when either Cartoon (Figure 2.3c) or Sketch (Figure 2.3d) is the target, samples from Photo and Art tend to be associated to the same latent domain and, similarly, when either Photo (Figure 2.3a) or Art (Figure 2.3b) is the target, samples from Cartoon and Sketch are mostly grouped together. These results confirms the ability of our approach to automatically assign images of similar visual appearance to the same latent distribution. In Figure 2.4, we show the top-6 images associated to each latent domain for each sources/target setting. In most cases, images associated to the same latent domain have similar appearance, while there is high dissimilarity between images associated to different latent domains. Moreover, images assigned to the same latent domain tend to be associated with one of the original domains. For instance, the first row of Figure 2.4a contains only images from Art, while the third contains only images from Sketch. Note that no explicit domain supervision is ever given to our method in this setting.
深入分析。我们进一步在PACS数据集上研究了我们的方法发现潜在域的能力。首先,在图2.3中,我们展示了在单目标设置下,我们的方法如何将源样本分配到不同的潜在域。这四个图对应于表2.3中实验的一次运行结果。有趣的是,当目标域为卡通(图2.3c)或素描(图2.3d)时,照片和艺术风格的样本往往会被分配到同一个潜在域;同样,当目标域为照片(图2.3a)或艺术(图2.3b)时,卡通和素描的样本大多会被归为一类。这些结果证实了我们的方法能够自动将视觉外观相似的图像分配到相同的潜在分布中。在图2.4中,我们展示了在每个源/目标设置下,与每个潜在域相关联的前6张图像。在大多数情况下,与同一潜在域相关联的图像外观相似,而与不同潜在域相关联的图像之间则存在很大差异。此外,分配到同一潜在域的图像往往与原始域之一相关联。例如,图2.4a的第一行只包含艺术风格的图像,而第三行只包含素描风格的图像。请注意,在这种设置下,我们的方法从未得到过明确的域监督。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_52.jpg?x=275&y=286&w=1105&h=843&r=0

Figure 2.3. Distribution of the assignments produced by the domain prediction branch for each latent domain in all possible settings of the PACS dataset. Different colors denote different source domains.
图2.3. 在PACS数据集的所有可能设置下,域预测分支为每个潜在域生成的分配分布。不同颜色表示不同的源域。


In Figure 2.5, we show the histograms of the domain assignment probabilities predicted by our model with λB=0 in the various multi-source,multi-target settings of Table 2.3. As the figures shows, in most cases the various pairs of target domains tend to be very well separated: this justifies the large gain of performances produced by our model in this scenario. The only cases where the separation is less marked is when Art and Photo, which have very similar visual appearance, are considered
在图2.5中,我们展示了在表2.3的各种多源多目标设置下,我们的模型使用λB=0预测的域分配概率直方图。如图所示,在大多数情况下,不同的目标域对往往能很好地分离:这解释了我们的模型在这种场景下性能大幅提升的原因。唯一分离不太明显的情况是当将视觉外观非常相似的艺术和照片作为目标域时。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_53.jpg?x=267&y=283&w=1118&h=668&r=0

Figure 2.4. Top-6 images associated to each latent domain for the different sources/target combinations. Each row corresponds to a different latent domain.
图2.4. 不同源/目标组合下,与每个潜在域相关联的前6张图像。每行对应一个不同的潜在域。


as targets. On the other hand, source domains are not always as clearly separated as the targets. In particular the pairs Photo-Cartoon, Art-Photo and Art-Cartoon, tend to receive similar assignments when they are considered as source. A possible explanation is that the supervised source loss could have a stronger influence on the domain assignment than the unsupervised target one. In any case, note that these results do not detract from the validity of our approach. In fact, our main objective is to obtain a good classification model for the target set, independently from the actual domain assignments we learn.
另一方面,源域并不总是像目标域那样清晰地分离。具体来说,当将照片 - 卡通、艺术 - 照片和艺术 - 卡通作为源域时,它们往往会得到相似的分配。一个可能的解释是,有监督的源损失对域分配的影响可能比无监督的目标损失更强。无论如何,请注意这些结果并不影响我们方法的有效性。事实上,我们的主要目标是为目标集获得一个良好的分类模型,而与我们学习到的实际域分配无关。

In Figure 2.6, the same analysis is performed on our method with the additional constraint of having a uniform assignment distribution among domains. As the figure shows, this constraint allows to obtain a clearer domain separation in most of the cases, overcoming the difficulties that the domain prediction branch experienced in separating domain pairs such as Photo-Cartoon and Photo-Art.
在图2.6中,对我们的方法进行了相同的分析,并增加了各领域间分配分布均匀的约束条件。如图所示,在大多数情况下,这一约束条件能够实现更清晰的领域分离,克服了领域预测分支在分离如照片 - 卡通和照片 - 艺术等领域对时所遇到的困难。

We perform a similar analysis in another dataset, Digits-five. The results are reported in Figure 2.7. As the figure shows, when SVHN is the target domain, one of the latent domains (latent domain 1) receives very confident assignments for the samples of the MNIST dataset. The samples of the other source datasets receive assignment spread through all the latent domains, with the exceptions of USPS which receives the most confident predictions for the second latent domain and MNIST-m, which partially influences the first latent domain, the one with confidence assignments to MNIST. One latent domain does not receive assignments form any of the sources (latent domain three): this might happen if the entropy term overcomes the uniform assignment constraints in the early stages of training. Similarly,when MNIST-m is the target domain, the first two latent domains receive confident assignments for samples belonging to MNIST and SVHN datasets respectively, while the third and the fourth receive higher assignments for samples of the remaining source domains.
我们在另一个数据集Digits - five上进行了类似的分析。结果如图2.7所示。如图所示,当SVHN(街景房屋号码数据集)作为目标领域时,其中一个潜在领域(潜在领域1)对MNIST(手写数字数据集)的样本给出了非常确定的分配。其他源数据集的样本分配则分散在所有潜在领域中,不过USPS(美国邮政服务手写数字数据集)在第二个潜在领域得到了最确定的预测,MNIST - m部分影响了第一个潜在领域(即对MNIST样本有确定分配的那个领域)。有一个潜在领域(潜在领域3)没有从任何源得到分配:如果在训练早期熵项超过了均匀分配约束,就可能出现这种情况。同样,当MNIST - m作为目标领域时,前两个潜在领域分别对属于MNIST和SVHN数据集的样本给出了确定的分配,而第三和第四个潜在领域对其余源领域的样本分配更多。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_54.jpg?x=260&y=281&w=1130&h=1557&r=0

Figure 2.5. Distribution of the assignments produced by the domain prediction branch in all possible multi-target settings of the PACS dataset. Different colors denote different source domains (red: Art, yellow: Cartoon, blue: Photo, green: Sketch).
图2.5. 在PACS数据集所有可能的多目标设置中,领域预测分支产生的分配分布。不同颜色表示不同的源领域(红色:艺术,黄色:卡通,蓝色:照片,绿色:素描)。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_55.jpg?x=258&y=295&w=1132&h=1566&r=0

Figure 2.6. Distribution of the assignments produced by the domain prediction branch trained with the additional constraint on the entropy loss in all possible multi-target settings of the PACS dataset. Different colors denote different source domains (red: Art, yellow: Cartoon, blue: Photo, green: Sketch).
图2.6. 在PACS数据集所有可能的多目标设置中,在熵损失上增加额外约束条件训练得到的领域预测分支产生的分配分布。不同颜色表示不同的源领域(红色:艺术,黄色:卡通,蓝色:照片,绿色:素描)。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_56.jpg?x=265&y=258&w=1119&h=884&r=0

Figure 2.7. Distribution of the assignments produced by the domain prediction branch for each latent domain in all target settings of the Digits-five dataset. Different colors denote different source domains (black: MNIST, blue: MNIST-m, green: USPS, red: SVHN, yellow: Synthetic numbers).
图2.7. 在Digits - five数据集的所有目标设置中,领域预测分支为每个潜在领域产生的分配分布。不同颜色表示不同的源领域(黑色:MNIST,蓝色:MNIST - m,绿色:USPS,红色:SVHN,黄色:合成数字)。


Experiments on Office-31
对Office - 31数据集的实验


In our Office-31 experiments we consider the following baselines, trained on the union of the source sets: (i) a plain AlexNet network; (ii) AlexNet with BN inserted after each fully-connected layer; and (iii) AlexNet + DIAL [29]. Additionally, we consider single source domain adaptation approaches, using the results reported in [286]. The methods are Transfer Component Analysis (TCA) [198], Geodesic Flow Kernel (GFK) [86], Deep Domain Confusion (DDC) [260], Deep Reconstruction Classification Networks (DRCN) [80] and Residual Transfer Network (RTN) [155], as well as the Reversed Gradient (RevGrad) [77] and Domain Adaptation Network (DAN) [154] algorithms considered in the digits experiments. For these algorithms we report the performances obtained in the "Best single source" and "Unified sources" settings, as available from [286]. As in the previous experiments, Multi-source DA with perfect domain knowledge can be regarded as a performance upper bound for our method. Finally, we include results reported in [286] for different multi-source DA models: Deep Cocktail Network (DCTN) [286], the two shallow methods in [282] (sFRAME) and [91] (SGF), and an ensemble of baseline networks trained on each source domain separately (Source only). These results are summarized in Table 2.5.
在我们对Office - 31数据集的实验中,我们考虑了以下在源集的并集上训练的基线模型:(i)普通的AlexNet网络;(ii)在每个全连接层之后插入批量归一化(BN)的AlexNet;以及(iii)AlexNet + DIAL [29]。此外,我们还考虑了单源领域自适应方法,采用了文献[286]中报告的结果。这些方法包括转移成分分析(TCA)[198]、测地流核(GFK)[86]、深度领域混淆(DDC)[260]、深度重构分类网络(DRCN)[80]和残差转移网络(RTN)[155],以及在数字实验中考虑的反向梯度(RevGrad)[77]和领域自适应网络(DAN)[154]算法。对于这些算法,我们报告了在“最佳单源”和“统一源”设置下获得的性能,这些信息可从文献[286]中获取。与之前的实验一样,具有完美领域知识的多源领域自适应可以被视为我们方法的性能上限。最后,我们纳入了文献[286]中不同多源领域自适应模型的结果:深度鸡尾酒网络(DCTN)[286]、文献[282]中的两种浅层方法(sFRAME)和文献[91]中的方法(SGF),以及分别在每个源领域上训练的基线网络的集成(仅源)。这些结果总结在表2.5中。

Table 2.5. Office-31 dataset: comparison of different methods using AlexNet. In the first row we indicate the source (top) and the target domains (bottom).
表2.5. Office - 31数据集:使用AlexNet的不同方法的比较。在第一行中,我们标明了源领域(上)和目标领域(下)。

Source Method TargetA-W DA-D WW-D AMean
Best single source [286]TCA [198]95.293.251.668.8
GFK [86]95.095.652.468.7
DDC 26098.595.052.270.7
DRCN [80]99.096.456.073.6
RevGrad[77]99.296.453.474.3
DAN [154]99.096.054.072.9
RTN [155]99.696.851.073.7
Unified sourcesSource only from [286]98.193.250.280.5
Source only94.689.149.177.6
Source only +BN91.992.746.577.0
RevGrad[286]98.896.254.683.2
DAN [286]98.895.253.482.5
Single BN92.995.260.182.7
DIAL [29]93.894.362.583.5
mDAλB=093.794.662.683.6
mDA93.693.662.483.2
Multi- sourceSource only [286]98.292.751.680.8
sFRAME[282]54.552.232.146.3
SGF[91]39.052.028.039.7
DCTN [286]99.696.954.983.8
Multi-source DA94.895.862.984.5
源方法 目标A-W DA-D WW-D A均值
最佳单源 [286]迁移成分分析(TCA) [198]95.293.251.668.8
高斯场核(GFK) [86]95.095.652.468.7
深度判别相关(DDC) 26098.595.052.270.7
深度残差相关网络(DRCN) [80]99.096.456.073.6
逆向梯度(RevGrad)[77]99.296.453.474.3
深度对抗网络(DAN) [154]99.096.054.072.9
循环转移网络(RTN) [155]99.696.851.073.7
统一源仅源来自 [286]98.193.250.280.5
仅源94.689.149.177.6
仅源 +BN91.992.746.577.0
逆向梯度(RevGrad)[286]98.896.254.683.2
深度对抗网络(DAN) [286]98.895.253.482.5
单批量归一化(Single BN)92.995.260.182.7
领域自适应集成学习(DIAL) [29]93.894.362.583.5
mDAλB=093.794.662.683.6
多领域自适应(mDA)93.693.662.483.2
多源仅源 [286]98.292.751.680.8
单帧自适应(sFRAME)[282]54.552.232.146.3
结构引导特征(SGF)[91]39.052.028.039.7
深度因果迁移网络(DCTN) [286]99.696.954.983.8
多源领域自适应(Multi - source DA)94.895.862.984.5


We note that, in this dataset, the improvements obtained by adopting a multi-source model instead of a single-source one are small. This is in accordance with findings in [133], where it is shown that the domain shift in Office-31, when considering deep features, is indeed quite limited if compared to PACS, and it is mostly linked to changes in the background (Webcam-Amazon, DSLR-Amazon) or acquisition camera (DSLR-Webcam). This is further supported by the smaller gap between DIAL and our method in this case compared to the previous experiments (average p of 0.54 ). In this setting,introducing our uniform loss term does not provides boost in performances. We ascribe this behaviour to the fact that in this scenario, each batch is built with a non-uniform number of samples per domain (following [28]) while our current objective assumes a balanced sampling among domains.
我们注意到,在这个数据集中,采用多源模型而非单源模型所获得的改进很小。这与文献[133]中的发现一致,该文献表明,在考虑深度特征时,与PACS相比,Office - 31中的领域偏移实际上相当有限,并且主要与背景变化(网络摄像头 - 亚马逊、数码单反相机 - 亚马逊)或采集相机(数码单反相机 - 网络摄像头)有关。与之前的实验相比,在这种情况下DIAL和我们的方法之间的差距更小(平均p为0.54),这进一步支持了上述观点。在这种设置下,引入我们的均匀损失项并不能提升性能。我们将这种现象归因于以下事实:在这种场景中,每个批次是按照每个领域非均匀数量的样本构建的(遵循文献[28]),而我们当前的目标假设各领域之间是平衡采样的。

In a final Office-31 experiment, we consider a setting where the true domain of a subset of the source samples is known at training time. Figure 2.8 shows the average accuracy obtained when a different amount of domain labels are available. Interestingly, by increasing the level of domain supervision the accuracy quickly saturates towards the value of Multi-source DA, completely filling the gap with as few as 5% of the source samples.
在最后一个Office - 31实验中,我们考虑一种在训练时已知部分源样本真实领域的设置。图2.8显示了在有不同数量的领域标签可用时所获得的平均准确率。有趣的是,随着领域监督水平的提高,准确率迅速趋近于多源域适应(Multi - source DA)的值,仅用5%的源样本就能完全填补差距。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_58.jpg?x=424&y=269&w=803&h=389&r=0

Figure 2.8. Office31 dataset. Performance at varying number of domain labels (%) for source samples.
图2.8. Office31数据集。源样本在不同数量的领域标签(%)下的性能。


Comparison with S.o.t.A. on inferring latent domains
与最先进方法在推断潜在领域方面的比较


In this section we compare the performance of our approach with previous works on DA which also consider the problem of inferring latent domains [104,283,85] . Since there are no previous works adopting deep learning models (i) in a multi-source setting and (ii) discovering hidden domains. Therefore, the methods we compare to all employ handcrafted features. For these approaches we report results taken from the original papers. Furthermore, we evaluate the method of Gong et al. [85] using features from the last layer of the AlexNet architecture. For a fair comparison, when applying our method we freeze AlexNet up to fc7, and apply mDA layers only after fc7 and the classifier.
在本节中,我们将我们的方法与之前在域适应(DA)方面的工作进行性能比较,这些工作也考虑了推断潜在领域[104,283,85]的问题。由于之前没有工作(i)在多源设置下采用深度学习模型,并且(ii)发现隐藏领域。因此,我们与之比较的方法都采用了手工特征。对于这些方法,我们报告的结果取自原始论文。此外,我们使用AlexNet架构最后一层的特征评估了Gong等人[85]的方法。为了进行公平比较,在应用我们的方法时,我们将AlexNet冻结到fc7层,并且仅在fc7层和分类器之后应用多域适应(mDA)层。

We first consider the Office-31 dataset, as this benchmark has been used in [104, 283], showing the results in Table 2.6. Our model outperforms all the baselines, with a clear margin in terms of accuracy. Importantly, even when the method in [85] is applied to features derived from AlexNet, still our approach leads to higher accuracy. For the sake of completeness, in the same table we also report results from previous multi-source DA methods [92, 190, 148] based on shallow models. While these approaches significantly outperform [104] and [283], still their accuracy is much lower than ours. Moreover, introducing our novel loss term provides higher performances with respect to the our approach with λB=0 .
我们首先考虑Office - 31数据集,因为这个基准数据集已在文献[104, 283]中使用,结果显示在表2.6中。我们的模型在准确率方面明显优于所有基线方法。重要的是,即使将文献[85]中的方法应用于从AlexNet导出的特征,我们的方法仍然能获得更高的准确率。为了完整起见,在同一表格中,我们还报告了基于浅层模型的先前多源域适应方法[92, 190, 148]的结果。虽然这些方法明显优于文献[104]和[283],但它们的准确率仍远低于我们的方法。此外,引入我们的新损失项相对于使用λB=0的方法能提供更高的性能。

To provide a comparison in a multi-target scenario, we also consider the Office-Caltech dataset,comparing our model with [104, 85]. Following [85], we test both single target (Amazon) and multi-target (Amazon-Caltech and Webcam-DSLR) scenarios. As for the PACS multi-source/multi-target case, the assignment of each sample to the source or target set is assumed to be known, while the assignment to the specific domain is unknown. We again want to remark that, since we do not assume to know the target domain to which a sample belongs, the task is even harder since we require a domain prediction step also at test time. As in the Office-31 experiments, our approach outperforms all baselines, including the method in [85] applied to AlexNet features. In this scenario, introducing our uniform loss provides a boost in performances in the multi-target setting, where the two source/target pairs have similar appearance. This is inline to what reported for the multitarget experiments on PACS (Table 2.4).
为了在多目标场景下进行比较,我们还考虑了Office - Caltech数据集,将我们的模型与文献[104, 85]进行比较。遵循文献[85],我们测试了单目标(亚马逊)和多目标(亚马逊 - 加州理工学院和网络摄像头 - 数码单反相机)场景。对于PACS多源/多目标情况,假设每个样本分配到源集或目标集是已知的,而分配到具体领域是未知的。我们再次想强调的是,由于我们不假设知道样本所属的目标领域,因此该任务更加困难,因为我们在测试时也需要一个领域预测步骤。与Office - 31实验一样,我们的方法优于所有基线方法,包括应用于AlexNet特征的文献[85]中的方法。在这种场景中,引入我们的均匀损失在多目标设置下能提升性能,在这种设置中,两个源/目标对具有相似的外观。这与PACS多目标实验的报告结果一致(表2.4)。

Table 2.6. Office-31: comparison with state-of-the-art algorithms. In the first row we indicate the source (top) and the target domains (bottom).
表2.6. Office - 31:与最先进算法的比较。在第一行中,我们标明了源领域(上)和目标领域(下)。

Sources Method TargetA-D WA-W DW-D AMean
Hoffman et al. [104]24.842.712.826.8
Xiong et al. [283]29.343.613.328.7
Gong et al. (AlexNet) [85]91.894.648.978.4
mDAλB=093.194.364.283.9
mDA94.594.964.984.8
Gopalan et al. [92]51.336.135.841.1
Nguyen et al. [190]64.568.641.858.3
Lin et al. [148]73.281.341.165.2
源 方法 目标A - D WA - W DW - D A均值
霍夫曼等人 [104]24.842.712.826.8
熊等人 [283]29.343.613.328.7
龚等人(亚历克斯网络) [85]91.894.648.978.4
mDAλB=093.194.364.283.9
mDA94.594.964.984.8
戈帕兰等人 [92]51.336.135.841.1
阮等人 [190]64.568.641.858.3
林等人 [148]73.281.341.165.2


Table 2.7. Office-Caltech dataset: comparison with state-of-the-art algorithms. In the first row we indicate the source (top) and the target domains (bottom).
表2.7. Office-Caltech数据集:与最先进算法的比较。在第一行中,我们标注了源域(上)和目标域(下)。


Source Method TargetA-C W-DW-D A-CC-W-D AMean
Gong et al. [85] - original41.735.841.039.5
Hoffman et al. [104] - ensemble31.734.438.935.0
Hoffman et al. [104] - matching39.634.034.636.1
Gong et al. [85] - ensemble38.735.842.839.1
Gong et al. [85] - matching42.635.544.640.9
Gong et al. (AlexNet) [85] - ensemble87.887.993.689.8
mDAλB=093.588.293.791.8
mDA95.088.793.992.5
源方法 目标A - C W - DW - D A - CC - W - D A均值
龚等人 [85] - 原始方法41.735.841.039.5
霍夫曼等人 [104] - 集成方法31.734.438.935.0
霍夫曼等人 [104] - 匹配方法39.634.034.636.1
龚等人 [85] - 集成方法38.735.842.839.1
龚等人 [85] - 匹配方法42.635.544.640.9
龚等人(亚历克斯网络) [85] - 集成方法87.887.993.689.8
mDAλB=093.588.293.791.8
多层去噪自编码器(mDA)95.088.793.992.5


2.4.6 Conclusions
2.4.6 结论


In this section, we presented a novel deep DA model for automatically discovering latent domains within visual datasets. The proposed deep architecture is based on a side-branch that computes the assignment of source and target samples to their associated latent domain. These assignments are then used within the main network by novel domain alignment layers which reduce the domain shift by aligning the feature distributions of the discovered sources and the target domains. Our experimental results demonstrate the ability of our model to efficiently exploit the discovered latent domains for addressing challenging domain adaptation tasks. Future works could investigate other architectural design choices for the domain prediction branch, as well as the possibility to integrate it into other CNN models for unsupervised domain adaptation [77]. In the next section, we will remove the assumption of having target data during training, focusing on the domain generalization scenario. We will show how mDA layers can be extended to effectively address the domain generalization problem.
在本节中,我们提出了一种新颖的深度领域自适应(DA)模型,用于自动发现视觉数据集中的潜在领域。所提出的深度架构基于一个侧分支,该侧分支计算源样本和目标样本到其关联潜在领域的分配。然后,这些分配在主网络中通过新颖的领域对齐层使用,这些层通过对齐所发现的源领域和目标领域的特征分布来减少领域偏移。我们的实验结果证明了我们的模型能够有效利用所发现的潜在领域来解决具有挑战性的领域自适应任务。未来的工作可以研究领域预测分支的其他架构设计选择,以及将其集成到其他用于无监督领域自适应的卷积神经网络(CNN)模型中的可能性 [77]。在下一节中,我们将去除训练期间存在目标数据的假设,专注于领域泛化场景。我们将展示如何扩展多领域自适应(mDA)层以有效解决领域泛化问题。

2.5 Domain Generalization
2.5 领域泛化


In the previous section, we showed how it is possible to overcome the domain shift problem effectively even when our source/target domain is a mixture of multiple ones. However, it relies on a fundamental assumption: the presence of target data during training. Unfortunately, this assumption is not always satisfied in practice.
在上一节中,我们展示了即使源/目标领域是多个领域的混合时,如何有效地克服领域偏移问题。然而,这依赖于一个基本假设:训练期间存在目标数据。不幸的是,这个假设在实践中并不总是成立。

Let us consider the problem of semantic place categorization from visual data [273]. This task is important in robotics, since correctly identifying the semantic category of a place allows the robot to improve its localization, mapping and exploration [249,120] capabilities. We have three strategies to address this problem. The first is using labeled datasets of training images [274, 65, 262, 173]. While the resulting models are very accurate when test samples are similar to training data, their performance significantly degrade when the robot collects images with very different visual appearance [210].
让我们考虑从视觉数据进行语义场所分类的问题 [273]。这项任务在机器人技术中很重要,因为正确识别场所的语义类别可以让机器人提高其定位、建图和探索 [249,120] 能力。我们有三种策略来解决这个问题。第一种是使用带标签的训练图像数据集 [274, 65, 262, 173]。虽然当测试样本与训练数据相似时,得到的模型非常准确,但当机器人收集到视觉外观差异很大的图像时,其性能会显著下降 [210]。

A second strategy could be exploiting domain adaptation (DA) techniques [208,117,46] . These methods develop models which are meant to be effective in the scenario where the robot will operate, i.e. the target domain. While domain adaptation algorithms provide effective solutions, they require some prior knowledge of the target domain at training time, e.g. to have access to target data. Unfortunately, this information may not always be available. Consider for instance an household robot: since the number of possible customers is huge, it is inconceivable to collect data for each possible house and application scenario.
第二种策略可以是利用领域自适应(DA)技术 [208,117,46]。这些方法开发的模型旨在在机器人将运行的场景(即目标领域)中有效。虽然领域自适应算法提供了有效的解决方案,但它们在训练时需要一些关于目标领域的先验知识,例如能够访问目标数据。不幸的是,这些信息并不总是可用的。例如,考虑一个家用机器人:由于可能的客户数量巨大,为每个可能的房屋和应用场景收集数据是不可想象的。

In this context, a more relevant problem to address is domain generalization (DG). As described in previous sections,opposite to DA,where target data are exploited to produce a classifier accurate under specific working conditions, the idea behind DG is to learn a domain agnostic model applicable to any unseen target domain. In other words, the goal of DG is building a model which is as general as possible, e.g. employable by different robots and in various environmental conditions.
在这种情况下,更相关的要解决的问题是领域泛化(DG)。如前几节所述,与领域自适应(DA)相反,领域自适应利用目标数据来产生在特定工作条件下准确的分类器,而领域泛化背后的想法是学习一个适用于任何未见目标领域的领域无关模型。换句话说,领域泛化的目标是构建一个尽可能通用的模型,例如可以被不同的机器人在各种环境条件下使用。

In this section,we build on the mDA layers presented in Section 2.4 and we first propose a novel deep learning framework for DG, namely We call this approach WBN (Weighted Batch Normalization for Domain Generalization) [164]. The approach develops from the idea that, given data from multiple source domains and the associated models, the best model for the target domain can be generated on-the-fly when a novel sample arrives by optimally combining the precomputed models from source domains (see Fig.2.9). To implement this idea we design a novel CNN architecture which relies on two main components. First, inspired by recent works on domain adaptation [29, 142], we construct multiple source models by embedding into a common CNN few domain-specific Batch Normalization layers. In this way, different classifiers can be built keeping the number of parameters limited. Second, we design a lateral network branch which computes the likelihood that a certain instance belongs to a given domain. When applied to a novel target sample, this branch calculates its probabilities to be part of the different source domains. These values are used to construct the target classifier performing a combination of known source models. This is similar to the idea of mDA layers, with the difference that (i) no target data are available during training and (ii) the domain assignment branch is used to compute the similarity of target samples with source domains.
在本节中,我们基于 2.4 节中提出的多领域自适应(mDA)层,首先为领域泛化提出了一种新颖的深度学习框架,即我们称之为 WBN(用于领域泛化的加权批量归一化)[164]。该方法源于这样一个想法:给定来自多个源领域的数据和相关模型,当一个新样本到来时,可以通过最优地组合源领域的预计算模型,即时生成适用于目标领域的最佳模型(见图 2.9)。为了实现这个想法,我们设计了一种新颖的卷积神经网络(CNN)架构,它依赖于两个主要组件。首先,受近期领域自适应工作 [29, 142] 的启发,我们通过在一个通用的卷积神经网络中嵌入几个特定领域的批量归一化层来构建多个源模型。通过这种方式,可以在限制参数数量的情况下构建不同的分类器。其次,我们设计了一个侧向网络分支,用于计算某个实例属于给定领域的可能性。当应用于一个新的目标样本时,这个分支会计算该样本属于不同源领域的概率。这些值用于通过组合已知的源模型来构建目标分类器。这与多领域自适应(mDA)层的想法类似,不同之处在于(i)训练期间没有目标数据可用,(ii)领域分配分支用于计算目标样本与源领域的相似度。

In the second part of this section, we extend this approach by considering domain-specific classifiers, and classifying each incoming target image by optimally fusing the prediction scores of the source-specific classifiers. As in WBN, this is achieved through an end-to-end trainable deep architecture with two main components. The first implements the source-specific classifiers, while the second module is a network branch which computes the similarities of an input sample to all source domains, such as to assign weights to the source classifiers and properly merge their predictions. The second module is also designed in order to easily permit, if needed, the integration of a domain agnostic classifier which, acting in synergy with the domain-specific models, can further improve generalization. We call this approach Best Sources Forward for Domain Generalization (BSF) [163].
在本节的第二部分,我们通过考虑特定领域分类器来扩展这种方法,并通过最优融合特定源分类器的预测分数对每个传入的目标图像进行分类。与加权批量归一化(Weighted Batch Normalization,WBN)方法一样,这是通过一个具有两个主要组件的端到端可训练深度架构实现的。第一个组件实现特定源分类器,而第二个模块是一个网络分支,它计算输入样本与所有源域的相似度,以便为源分类器分配权重并适当地合并它们的预测结果。第二个模块的设计还考虑到,如果需要,能够轻松集成一个与领域无关的分类器,该分类器与特定领域模型协同工作,可以进一步提高泛化能力。我们将这种方法称为用于领域泛化的最佳源前向传播(Best Sources Forward for Domain Generalization,BSF)[163]。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_61.jpg?x=321&y=257&w=987&h=622&r=0

Figure 2.9. The domain generalization problem. At training time (orange block) images of multiple source domains (e.g. A,B,C) are available. These images are used to train different models with parameters θi . Our approach automatically computes a model D which accurately classifies images of a novel domain (not available during training) by combining the models of the known domains.
图2.9. 领域泛化问题。在训练阶段(橙色块),可以获取多个源域(例如A、B、C)的图像。这些图像用于训练具有参数θi的不同模型。我们的方法通过组合已知领域的模型,自动计算出一个模型D,该模型能够准确分类新领域(训练期间不可用)的图像。


To this aim, the novel Weighted Batch Normalization (WBN) layers are introduced. We demonstrate the effectiveness of the proposed DG approach with extensive experiments on three datasets, namely the COsy Localization Database (COLD) [209], the Visual Place Categorization (VPC) dataset [273] and the Specific PlacEs Dataset (SPED) [42]. Moreover, we show how the proposed framework can be employed where no prior information about source domains is available at training time: given a training set, our model can be used to automatically cluster training data and learning multiple models, discovering latent domains and associated classifiers.
为此,引入了新颖的加权批量归一化(Weighted Batch Normalization,WBN)层。我们在三个数据集上进行了广泛的实验,证明了所提出的领域泛化(Domain Generalization,DG)方法的有效性,这三个数据集分别是COsy定位数据库(COsy Localization Database,COLD)[209]、视觉场景分类(Visual Place Categorization,VPC)数据集[273]和特定场景数据集(Specific PlacEs Dataset,SPED)[42]。此外,我们展示了在训练时没有关于源域的先验信息的情况下如何使用所提出的框架:给定一个训练集,我们的模型可用于自动对训练数据进行聚类并学习多个模型,发现潜在领域和相关的分类器。

To summarize, the main contributions of this section are: (i) an extension of the mDA framework, WBN, which exploits the similarity of target samples with the given source domains to address the DG problem; (ii) we introduce the problem of domain generalization for semantic place recognition, showing how WBN is effective in addressing it, even without exact domain knowledge; (iii) we extend of WBN, by considers source-specific classifiers in place of domain-specific alignment layers, showing its effectiveness in standard DG benchmarks in computer vision.
综上所述,本节的主要贡献如下:(i)对最大均值差异(Maximum Mean Discrepancy,mDA)框架的扩展,即加权批量归一化(WBN),它利用目标样本与给定源域的相似性来解决领域泛化问题;(ii)我们提出了语义场景识别中的领域泛化问题,并展示了即使在没有确切领域知识的情况下,WBN在解决该问题方面的有效性;(iii)我们对WBN进行了扩展,用特定源分类器代替特定领域对齐层,并展示了其在计算机视觉标准领域泛化基准测试中的有效性。

2.5.1 Problem Formulation
2.5.1 问题表述


The goal of DG is to extend the knowledge acquired from a set of source domains to any unknown target domain. In this context, the source sets correspond, e.g., to data acquired by multiple robots in different environments while the unknown target to any unseen environment. Formally, following the notation in Section 2.1, we have our training set defined as S={(xis,yis,si)}i=1n where xisX,yisY and siDs , with DsD . Note that no target domain data T is available during training. Moreover,we assume |Ds|=ks>1 ,analyzing in Sections 2.6 and 2.7 the case where |Ds|=1 but other information is available. Our goal is to learn a predictor f:XY able to work in any possible target domain Dt unseen during training, i.e. DsDt . It is worth highlighting that,differently from the Latent Domain Discovery problem presented in Section 2.4, here we might have full knowledge about the domain labels.
领域泛化(DG)的目标是将从一组源域中获得的知识扩展到任何未知的目标域。在这种情况下,源集例如对应于多个机器人在不同环境中获取的数据,而未知目标则对应于任何未见过的环境。形式上,遵循2.1节中的符号表示,我们将训练集定义为S={(xis,yis,si)}i=1n,其中xisX,yisYsiDs,且DsD。请注意,训练期间没有目标域数据T可用。此外,我们假设|Ds|=ks>1,并在2.6节和2.7节中分析|Ds|=1但有其他信息可用的情况。我们的目标是学习一个预测器f:XY,使其能够在训练期间未见过的任何可能的目标域Dt中工作,即DsDt。值得强调的是,与2.4节中提出的潜在领域发现问题不同,这里我们可能对领域标签有全面的了解。

2.5.2 Starting point: Domain Generalization with Weighted BN 6
2.5.2 起点:使用加权批量归一化6进行领域泛化


A clear issue with DA methods, including DA-layers and mDA, is that they require the presence of a target set Xt in the training phase. This implies that data collected in the scenario of interest should be available for learning the classification model. However, a more realistic situation, especially in robotics, is when we employ our system in completely unseen environments/domains. As an example, consider a service robot: it is unfeasible to collect data for all possible working environments. Therefore, it is important to drop the assumption of having target data beforehand while designing deep models addressing the domain shift problem. In this subsection, we start by removing target data from DA-layers and mDA layers.
包括DA层(Domain Alignment layers,域对齐层)和mDA(Multi-scale Domain Adaptation,多尺度域适应)在内的域适应(Domain Adaptation,DA)方法存在一个明显的问题,即它们在训练阶段需要有目标集 Xt 。这意味着在感兴趣的场景中收集的数据应该可用于学习分类模型。然而,更现实的情况,尤其是在机器人领域,是我们将系统应用于完全未知的环境/领域。例如,考虑一个服务机器人:为所有可能的工作环境收集数据是不可行的。因此,在设计解决域偏移问题的深度模型时,放弃事先拥有目标数据的假设是很重要的。在本小节中,我们首先从DA层和mDA层中去除目标数据。

From the formulation of DA-layers defined in Eq. (2.1), we can obtain multiple, domain-specific models by considering separate BN statistics for each of the source domains during training. In particular,given the features of a sample xi at a given layer and spatial location (omitted for simplicity) as well as its domain label si ,we can apply the domain-specific BN as follows:
从式(2.1)定义的DA层公式中,我们可以在训练期间为每个源域考虑单独的 BN 统计信息,从而获得多个特定于域的模型。具体来说,给定某一层和空间位置(为简单起见省略)处样本 xi 的特征及其域标签 si ,我们可以按如下方式应用特定于域的批量归一化(Batch Normalization,BN):

(2.8)x^i=γxiμsiσsi2+ϵ+β.

The problem with this formulation is that, at test time, no statistics from the unseen target domains are available. To solve this problem, we restore to a soft-version of Eq. (2.8). Let us write Eq. (2.8) as:
这个公式的问题在于,在测试时,无法获得来自未知目标域的统计信息。为了解决这个问题,我们采用式(2.8)的软版本。让我们将式(2.8)写成:

(2.9)x^i=γj=1ks1si=sjxiμsjσsj2+ϵ+β.


6 M. Mancini,S. Rota Bulò,B. Caputo,E. Ricci. Robust Place Categorization with Deep Domain Generalization. IEEE Robotics and Automation Letters, July 2018, vol. 3, n. 3., pp. 2093-2100.
6 M. Mancini,S. Rota Bulò,B. Caputo,E. Ricci。基于深度域泛化的鲁棒场所分类。《IEEE机器人与自动化快报》,2018年7月,第3卷,第3期,第2093 - 2100页。


The hard-assignment in Eq. (2.9), used at training time, can be replaced with a weighted version at test time, modeling the uncertainty we have about our target domain. In particular, we can write:
式(2.9)中在训练时使用的硬分配,可以在测试时用加权版本代替,以模拟我们对目标域的不确定性。具体来说,我们可以写成:

(2.10)x^it=γj=1kswi,jxiμsjσsj2+ϵ+β,



https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_63.jpg?x=373&y=263&w=906&h=1089&r=0

Figure 2.10. Example of the proposed WBN framework. (a) AlexNet with BN layers after each fully connected. (b) The same network employing Domain Alignment layers for domain adaptation,where different BN are used for source and target domains. (c) Our approach for DG with WBN layers.
图2.10. 所提出的加权批量归一化(Weighted Batch Normalization,WBN)框架示例。(a) 每个全连接层后都有BN层的AlexNet。(b) 采用域对齐层进行域适应的同一网络,其中源域和目标域使用不同的 BN 。(c) 我们使用WBN层进行域泛化(Domain Generalization,DG)的方法。


where wi,j is the probability of sample i to belong to domain j ,with j=1Nwi,j=1 and jwi,j0 . The intuition behind this choice is deriving a classification model for the target domain as a combination of models from the source domains, with the weights derived from the similarity of the target domain data to the source domains.
其中 wi,j 是样本 i 属于域 j 的概率,且 j=1Nwi,j=1jwi,j0 。这种选择背后的直觉是,将目标域的分类模型推导为源域模型的组合,其权重来自目标域数据与源域的相似度。

In order to compute the weights wi,j ,we restore to the same domain classification module described in Section 2.4.3, employing a separate network branch which originates from the first few convolutional layers of the main network (see Fig. 2.10c). This choice is motivated by the fact that end-to-end training is allowed and the number of parameters is kept limited. The specific architecture of the branch may be variable, with the only restriction that its final output must be a probability vector of dimension ks ,corresponding to the number of known domains.
为了计算权重 wi,j ,我们采用2.4.3节中描述的相同域分类模块,使用一个单独的网络分支,该分支源自主网络的前几层卷积层(见图2.10c)。这种选择的动机是允许端到端训练,并且参数数量保持有限。该分支的具体架构可以是可变的,唯一的限制是其最终输出必须是一个维度为 ks 的概率向量,对应于已知域的数量。
Denoting the classification branch as fCθ and fDθ ,during training we minimize a simplified version of Eq. (2.5), namely:
将分类分支表示为 fCθfDθ ,在训练期间,我们最小化式(2.5)的简化版本,即:

(2.11)L(θ)=1ni=1nlogfCθ(yis;xis)+λlogfDθ(di;xi)

the loss is the sum of two terms, one considering place label information for accurate recognition, the other enforcing the lateral branch to successfully compute the correct domain,with λ balances the contribution of the semantic classification and the domain prediction terms. At test time,the domain assignment produced by fDθ for target samples will be used to obtain the domain similarity w,j of Eq. (2.10).
损失是两项之和,一项考虑场所标签信息以进行准确识别,另一项强制侧分支成功计算正确的域,其中 λ 平衡语义分类和域预测项的贡献。在测试时, fDθ 为目标样本产生的域分配将用于获得式(2.10)中的域相似度 w,j

Finally, we would like to highlight that this framework can be easily extended to perform DG in the lack of domain labels, following what described in Section 2.4. In particular, we can rely on the soft-assignment strategy to compute the latent domain statistics, as in Eq. (2.2). As in the previous section, the intuition is that, since similar input images will tend to produce similar outputs in the lateral network branch, implicitly visual data will be automatically clustered, enabling a latent domain discovery process. In this scenario, we let the domain assignment network be guided by the semantic loss while computing the statistics using Eq. (2.2). In Fig.2.10 we show the difference between this model and standard DA-layers.
最后,我们想强调的是,按照2.4节的描述,该框架可以很容易地扩展到在缺乏领域标签的情况下执行领域泛化(DG)。具体来说,我们可以像式(2.2)那样,依靠软分配策略来计算潜在领域统计信息。与上一节类似,其直觉是,由于相似的输入图像在横向网络分支中往往会产生相似的输出,因此视觉数据将隐式地自动聚类,从而实现潜在领域发现过程。在这种情况下,我们让领域分配网络在使用式(2.2)计算统计信息时受语义损失的引导。在图2.10中,我们展示了该模型与标准领域自适应层(DA - layers)之间的差异。

2.5.3 WBN Experiments: Domain Generalization in Semantic Place Categorization
2.5.3 加权批量归一化(WBN)实验:语义场所分类中的领域泛化


Datasets. In our experiments we use three robot vision datasets, namely the widely adopted COLD [209] and VPC [273] datasets, and the recent SPED dataset [42].
数据集。在我们的实验中,我们使用了三个机器人视觉数据集,即被广泛采用的COLD [209]和VPC [273]数据集,以及最近的SPED数据集 [42]。

The COLD Database contains three datasets of indoor scenes acquired in different laboratories and from different robots. The COLD-Freiburg (Fr) has 26 image sequences collected in the Autonomous Intelligent Systems Laboratory at the University of Freiburg, with a camera mounted on an ActivMedia Pioneer-3 robot. COLD-Ljubljana (Lj) contains 18 sequences acquired from an iRobot ATRV-Mini platform at the Visual Cognitive Systems Laboratory of University of Ljubljana. In the COLD-Saarbrücken (Sa) an ActivMedia PeopleBot has been employed to gather 29 sequences inside the Language Technology Laboratory at the German Research Center for Artificial Intelligence in Saarbrücken.
COLD数据库包含在不同实验室中由不同机器人采集的三组室内场景数据集。COLD - 弗莱堡(Fr)数据集有26个图像序列,这些序列是在弗莱堡大学的自主智能系统实验室中,通过安装在ActivMedia Pioneer - 3机器人上的相机采集的。COLD - 卢布尔雅那(Lj)数据集包含18个序列,这些序列是在卢布尔雅那大学的视觉认知系统实验室中,从iRobot ATRV - Mini平台采集的。在COLD - 萨尔布吕肯(Sa)数据集中,使用了ActivMedia PeopleBot在萨尔布吕肯的德国人工智能研究中心的语言技术实验室内部采集了29个序列。

The VPC dataset contains images acquired from several rooms of 6 different houses with multiple floors. The images are acquired by means of a camcorder placed on a rolling tripod, simulating a mobile robotic platform. The dataset contains 11 semantic categories, but only 5 are common to all houses: bedroom, bathroom, kitchen,living room and dining-room. Following previous works [273,65,290] ,we use the common categories in our experiments.
VPC数据集包含从6栋不同的多层房屋的多个房间中采集的图像。这些图像是通过放置在滚动三脚架上的摄像机采集的,模拟了移动机器人平台。该数据集包含11个语义类别,但只有5个类别是所有房屋共有的:卧室、浴室、厨房、客厅和餐厅。遵循先前的工作 [273,65,290],我们在实验中使用这些共有类别。
SPED is a large scale dataset introduced in the context of place recognition. It contains images of 2543 outdoor cameras collected from the Archive of Many Outdoor Scenes (AMOS) [110] during February and August 2014 7
SPED是在场所识别背景下引入的大规模数据集。它包含2014年2月和8月从多户外场景档案(AMOS) [110] 中收集的2543个户外摄像头的图像 7

Networks and training protocols. For COLD and VPC we perform experiments with two common architectures: AlexNet [124] and ResNet [98]. For AlexNet we use the standard architecture pre-trained on Imagenet [52]. In all the experiments, we fine-tune the last two fully-connected layers,rescaling the input images to 227× 227 pixels. For ResNet we consider the 10 layers version of the architecture, again pre-trained on ImageNet. In all the experiments, we rescale the input images to 224x224 pixels, fine-tuning the network starting from the last residual block. Both the networks are trained with a weight decay of 0.0005 and an initial learning rate of 0.001 , while the initial learning-rate of the final classifier is set to 0.01 . The learning rate is dropped of a 0.1 factor after 90% of the iterations. For the experiments on COLD, we use a batch size of 256 for AlexNet and 64 for ResNet, training the networks for 1000 iterations. For VPC, we set the batch size to 128 and 64 for AlexNet and ResNet respectively, training the networks for 2000 iterations. The training parameters are the same for our method and the baselines and fine-tuning is performed for all the models.
网络和训练协议。对于COLD和VPC数据集,我们使用两种常见的架构进行实验:AlexNet [124]和ResNet [98]。对于AlexNet,我们使用在ImageNet [52]上预训练的标准架构。在所有实验中,我们微调最后两个全连接层,并将输入图像重新缩放为 227× 227像素。对于ResNet,我们考虑该架构的10层版本,同样在ImageNet上进行了预训练。在所有实验中,我们将输入图像重新缩放为224x224像素,并从最后一个残差块开始微调网络。两个网络都以0.0005的权重衰减和0.001的初始学习率进行训练,而最终分类器的初始学习率设置为0.01。在 90% 次迭代后,学习率降低为原来的0.1倍。对于COLD数据集的实验,我们对AlexNet使用256的批量大小,对ResNet使用64的批量大小,并对网络进行1000次迭代训练。对于VPC数据集,我们分别为AlexNet和ResNet设置128和64的批量大小,并对网络进行2000次迭代训练。我们的方法和基线方法的训练参数相同,并且对所有模型都进行微调。

WBN can be applied to common CNNs by simply replacing standard BN layers with our WBN layers. While for ResNet BN layers are already employed, this is not true for AlexNet. For these experiments we employ a variant of AlexNet where BN layers are inserted after each fully-connected layer.
通过简单地将标准批量归一化(BN)层替换为我们的加权批量归一化(WBN)层,WBN可以应用于常见的卷积神经网络(CNNs)。虽然ResNet已经使用了BN层,但AlexNet并非如此。在这些实验中,我们使用AlexNet的一个变体,即在每个全连接层之后插入BN层。

For SPED we use AlexNet and the AMOSNet architecture, following [42]. AMOSNet is very similar to AlexNet, with the first fully-connected layer replaced by a convolutional layer and a pooling operation. We follow the same protocol of [42], using the same hyperparameters for training. We train both networks from scratch, applying BN or WBN layers after each layer with parameters, except the classifier. The implementation details of the domain assignment branch follows the one of Section 2.4.5 and we set λ=1 for all the experiments.
对于SPED数据集,我们遵循文献 [42],使用AlexNet和AMOSNet架构。AMOSNet与AlexNet非常相似,其第一个全连接层被卷积层和池化操作所取代。我们遵循文献 [42] 的相同协议,使用相同的超参数进行训练。我们从头开始训练这两个网络,在除分类器之外的每个有参数的层之后应用BN或WBN层。领域分配分支的实现细节遵循2.4.5节的内容,并且我们在所有实验中都设置 λ=1

The evaluation was performed using a NVIDIA GeForce 1070 GTX GPU, implementing all the models with the popular Caffe [111] framework. For the baseline AlexNet architecture we take the pre-trained model available in Caffe, while for ResNet we consider the model from [244]. The code implementing the WBN layers is publicly available 8
评估使用NVIDIA GeForce 1070 GTX图形处理器(GPU)进行,使用流行的Caffe [111]框架实现所有模型。对于基线AlexNet架构,我们采用Caffe中可用的预训练模型,而对于ResNet,我们考虑使用文献[244]中的模型。实现WBN层的代码是公开可用的 8

Results on COLD. We first perform experiments on the COLD database, where the goal is to demonstrate the effectiveness of WBN in learning effective classification models in case of varying environmental conditions (e.g. illuminations, laboratories). For each laboratory and illumination condition we consider the standard sequences 1 of part A, except for Saarbrücken Cloudy, for which we take sequence 2 due to known acquisition issues 9 and Saarbrücken Sunny,for which we take part B since sunny sequences for part A are not available. We consider the 4 classes shared between the sequences: printer area, corridor, bathroom and office (obtained by merging 1-person and 2-persons office). We report the results as the average accuracy per class. In these experiments we consider both AlexNet and ResNet comparing WBN with baseline models obtained adding traditional BN layers to the same architectures.
COLD数据集上的结果。我们首先在COLD数据库上进行实验,目的是证明在环境条件变化(例如光照、实验室环境)的情况下,WBN在学习有效的分类模型方面的有效性。对于每个实验室和光照条件,我们考虑A部分的标准序列1,但萨尔布吕肯(Saarbrücken)阴天情况除外,由于已知的采集问题 9,我们采用序列2;萨尔布吕肯晴天情况,由于A部分没有晴天序列,我们采用B部分。我们考虑序列中共享的4个类别:打印机区域、走廊、浴室和办公室(通过合并单人办公室和双人办公室得到)。我们将结果报告为每个类别的平均准确率。在这些实验中,我们同时考虑了AlexNet和ResNet,将WBN与通过在相同架构中添加传统批量归一化(BN)层得到的基线模型进行比较。


7 The full dataset was not available at the time we proposed WBN,but the authors provided us a subset with about 500 images per camera corresponding to 900 categories.
7 在我们提出WBN时,完整的数据集不可用,但作者为我们提供了一个子集,每个相机大约有500张图像,对应900个类别。

8 https://github.com/mancinimassimiliano/caffe

9 http://www.cas.kth.se/COLD/bugs.php



Table 2.8. DG accuracy on COLD over different lighting conditions. First row indicates the target sequence, with the first letters denoting the laboratory and the last the illumination condition (C=Cloudy, S=Sunny, N=Night). Vertical lines separate domains of the same laboratory. * indicates the algorithm uses domain knowledge.
表2.8. COLD数据集在不同光照条件下的领域泛化(DG)准确率。第一行表示目标序列,首字母表示实验室,最后一个字母表示光照条件(C=阴天,S=晴天,N=夜间)。竖线分隔同一实验室的不同领域。*表示该算法使用了领域知识。

NetNorm.Fr.CFr.NFr.SLj.CLj.NLj.SSa.CSa.NSa.Savg.
AlexNetBN97.389.197.492.964.494.275.669.744.080.5
WBN98.191.397.193.165.194.177.768.850.281.7
WBN*97.191.998.093.965.695.077.269.949.982.1
ResNetBN97.782.290.789.561.290.370.773.038.777.1
WBN98.181.894.194.561.793.775.876.937.879.4
WBN*97.981.393.494.765.194.678.176.538.580.0
网络范数Fr.CFr.NFr.SLj.CLj.NLj.SSa.CSa.NSa.S平均值
亚历克斯网络(AlexNet)批量归一化(BN)97.389.197.492.964.494.275.669.744.080.5
加权批量归一化(WBN)98.191.397.193.165.194.177.768.850.281.7
加权批量归一化*(WBN*)97.191.998.093.965.695.077.269.949.982.1
残差网络(ResNet)批量归一化(BN)97.782.290.789.561.290.370.773.038.777.1
加权批量归一化(WBN)98.181.894.194.561.793.775.876.937.879.4
加权批量归一化*(WBN*)97.981.393.494.765.194.678.176.538.580.0


We test two different variants of the proposed approach. In the first case (WBN*) we consider the presence of domain priors at training time, as in Section 2.5.2. In the second variant, WBN, we do not assume to have knowledge about domains at training time, thus our model just relies on the soft-assignment. We highlight that WBN with soft-assignment is similar to the mDA layers of Section 2.4 except that (i) no loss is applied on the domain prediction branch and (ii) no target data are available during training, thus no statistics are available for them and we must rely on the domain prediction branch also at test time.
我们测试了所提出方法的两种不同变体。在第一种情况(WBN*)中,我们考虑在训练时存在领域先验知识,如第2.5.2节所述。在第二种变体WBN中,我们假设在训练时没有关于领域的知识,因此我们的模型仅依赖于软分配。我们强调,采用软分配的WBN与第2.4节中的mDA层类似,不同之处在于:(i)在领域预测分支上不应用损失函数;(ii)训练期间没有目标数据可用,因此无法获取它们的统计信息,并且在测试时我们也必须依赖领域预测分支。

Firstly, we consider different lighting conditions, i.e. we assume that the domain shift is due to changes of illuminations. To this extent we train the network on sequences of the same laboratory, training on two lighting conditions (e.g. sunny and cloudy) and testing on the third (e.g. night). The results are reported in Table 2.8.
首先,我们考虑不同的光照条件,即我们假设领域偏移是由光照变化引起的。为此,我们在同一实验室的序列上训练网络,在两种光照条件(例如晴天和阴天)下进行训练,并在第三种光照条件(例如夜间)下进行测试。结果报告在表2.8中。

As expected,when knowledge about domains is available (WBN) ,improved classification accuracy can be obtained, in general, with respect to a domain agnostic classifier. Interestingly, for both networks the result of WBN without domain priors is either comparable or surpasses the baseline in almost all settings. This suggests that the network is able to latently discover clusters of samples and effectively using this information for learning robust classification models.
正如预期的那样,当有关于领域的知识(WBN)时,与不考虑领域的分类器相比,通常可以获得更高的分类准确率。有趣的是,对于两个网络而言,没有领域先验知识的WBN的结果在几乎所有设置中都与基线相当或超过基线。这表明网络能够潜在地发现样本簇,并有效地利用这些信息来学习鲁棒的分类模型。

Secondly, we perform a similar analysis to Table 2.8 but considering changes of robotic platform/environment. We keep constant the lighting condition, training on two laboratories and testing on the third. Table 2.9 shows the obtained results. Again, in most cases exploiting domain priors brings benefits in term of performances, for both networks. The results of Tables 2.8 and 2.9 show that the benefits of our WBN layer, with and without domain loss, are not limited to a particular type of domain shift (i.e. changes in robots, environment or illumination condition), demonstrating that our approach provides a general and effective strategy to address domain variations. In both experiments,there are few cases in which standard BN achieves comparable or slightly superior results w.r.t. WBN. A possible reason is that in some situations the ability of our model to generalize to novel settings may be hindered by the small number or by the specific characteristics of the available source domains.
其次,我们进行了与表2.8类似的分析,但考虑了机器人平台/环境的变化。我们保持光照条件不变,在两个实验室进行训练,并在第三个实验室进行测试。表2.9显示了所获得的结果。同样,在大多数情况下,对于两个网络而言,利用领域先验知识在性能方面都有好处。表2.8和表2.9的结果表明,无论有无领域损失,我们的WBN层的优势并不局限于特定类型的领域偏移(即机器人、环境或光照条件的变化),这证明我们的方法为处理领域变化提供了一种通用且有效的策略。在这两个实验中,有少数情况下标准BN相对于WBN取得了相当或略优的结果。一个可能的原因是,在某些情况下,可用源领域的数量较少或其特定特征可能会阻碍我们的模型在新设置下的泛化能力。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_67.jpg?x=271&y=279&w=536&h=227&r=0

Figure 2.11. Distribution of the values of the weights computed with AlexNet+WBN for the scenario Lj.N as target in Table 2.9. Different colors represent different original source domains.
图2.11. 在表2.9中以Lj.N场景为目标时,使用AlexNet + WBN计算的权重值的分布。不同颜色代表不同的原始源领域。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_67.jpg?x=841&y=273&w=535&h=223&r=0

Figure 2.12. Distribution of accuracy gains of AlexNet+WBN* w.r.t. AlexNet+BN considering Saarbrücken as target, varying both laboratory and illumination. Colors indicate larger (blue), lower (red) and comparable (green) performances.
图2.12. 以萨尔布吕肯为目标,同时改变实验室和光照条件时,AlexNet + WBN*相对于AlexNet + BN的准确率提升分布。颜色表示更高(蓝色)、更低(红色)和相当(绿色)的性能。

Table 2.9. DG accuracy on COLD over different environments/sensors. First row indicates the target sequence, with the first letters denoting the laboratory and the last the illumination condition (C=Cloudy, S=Sunny, N=Night). Vertical lines separate domains with same illumination condition. * indicates the algorithm uses domain knowledge.
表2.9. COLD在不同环境/传感器上的领域泛化(DG)准确率。第一行表示目标序列,首字母表示实验室,最后一个字母表示光照条件(C = 阴天,S = 晴天,N = 夜间)。竖线分隔具有相同光照条件的领域。*表示该算法使用了领域知识。

NetNorm.Fr.CSa.CLj.CFr.NSa.NLj.NFr.SLj.SSa.Savg.
AlexNetBN26.038.434.427.926.633.128.834.225.130.5
WBN25.838.233.029.426.634.830.336.925.131.1
WBN*25.940.333.428.027.634.931.544.328.632.7
ResNetBN37.940.939.330.848.341.230.640.627.637.5
WBN37.339.542.640.451.841.033.839.630.839.6
WBN*36.640.340.041.256.245.235.439.425.640.0
网络范数(Norm.)Fr.CSa.CLj.CFr.NSa.NLj.NFr.SLj.SSa.S平均值(avg.)
亚历克斯网络(AlexNet)批量归一化(BN)26.038.434.427.926.633.128.834.225.130.5
WBN25.838.233.029.426.634.830.336.925.131.1
WBN*25.940.333.428.027.634.931.544.328.632.7
残差网络(ResNet)批量归一化(BN)37.940.939.330.848.341.230.640.627.637.5
WBN37.339.542.640.451.841.033.839.630.839.6
WBN*36.640.340.041.256.245.235.439.425.640.0


In order to verify the ability of WBN to discover latent domains, Fig. 2.11 shows the distribution of the values w^i,j computed for the images of the original source domains associated to one of the experiments in Table 2.9. The plots associated to other experiments are similar and we do not report them due to lack of space. Since we consider two latent domains in these experiments and w^i,1+w^i,2=1 ,we report only the values computed for w^i,1 . Different colors represent the original source domains. As the figure shows, the lateral branch computes different assignments for the samples of the different original source domains. As a result, the latent source domains extracted by WBN tend to correspond to the original ones used by WBN*.
为了验证WBN(加权批量归一化,Weighted Batch Normalization)发现潜在域的能力,图2.11展示了为表2.9中某一实验相关的原始源域图像计算得到的值w^i,j的分布情况。与其他实验相关的图表类似,由于篇幅限制,我们不再展示。由于在这些实验中我们考虑了两个潜在域,且w^i,1+w^i,2=1 ,因此我们仅报告为w^i,1计算得到的值。不同颜色代表不同的原始源域。如图所示,侧分支为不同原始源域的样本计算出不同的分配结果。因此,WBN提取的潜在源域往往与WBN*使用的原始源域相对应。

In another series of experiments we consider the scenario where both illumination and laboratory change. We performed 27 different experiments, corresponding to the case where Saarbrücken is considered as target domain. Figure 2.12 report the histogram of the gains in accuracy of our approach AlexNet+WBN* w.r.t. AlexNet+BN. As shown in Fig. 2.12, in most of the cases our model leads to an increase in accuracy between 1-5%. In only 5 out of 27 experiments, our model does not produce benefits.
在另一系列实验中,我们考虑光照和实验室环境均发生变化的场景。我们进行了27个不同的实验,对应于将萨尔布吕肯(Saarbrücken)视为目标域的情况。图2.12展示了我们的方法AlexNet+WBN*相对于AlexNet+BN在准确率上的提升直方图。如图2.12所示,在大多数情况下,我们的模型使准确率提高了1 - 5%。在27个实验中,只有5个实验中我们的模型没有带来益处。

Comparison with SOTA on VPC. In order to compare our model with the state-of-the art approaches in robotics, we consider the VPC dataset. VPC has been used in previous works to test the DG abilities of different methods. Following the standard experimental protocol of [273], we evaluate our model using 5 houses for training and 1 for test, averaging the results between the 6 configurations. For each house we report the average accuracy per class. Table 2.10 compares the result of our models with baseline deep architectures, with and without traditional BN layers. We consider both the case where domain information is available (WBN) and where it is not (WBN). Analogously to what observed in the experiments on COLD dataset, the accuracy increases when WBN is adopted, both in case of AlexNet and ResNet architectures. Interestingly, having domain priors during training produce a boost of performances for ResNet, while for AlexNet this is not the case. This suggests that different features have a different impact on our model. Features of the very last layers, as in AlexNet, may not be enough domain discriminative, especially in case of limited shift within the source domains. In those cases, a soft-assignment can provide a more effective strategy for clustering samples.
在VPC(视觉定位挑战,Visual Place Recognition Challenge)数据集上与最先进方法的比较。为了将我们的模型与机器人领域的最先进方法进行比较,我们考虑使用VPC数据集。VPC数据集在以往的工作中被用于测试不同方法的领域泛化(DG,Domain Generalization)能力。按照文献[273]中的标准实验协议,我们使用5个房屋数据进行训练,1个房屋数据进行测试来评估我们的模型,并对6种配置的结果进行平均。对于每个房屋,我们报告每个类别的平均准确率。表2.10将我们的模型结果与有和没有传统BN(批量归一化,Batch Normalization)层的基线深度架构进行了比较。我们考虑了域信息可用(WBN)和不可用(WBN)这两种情况。与在COLD数据集上的实验结果类似,无论是在AlexNet还是ResNet架构中,采用WBN时准确率都会提高。有趣的是,在训练过程中使用域先验信息对ResNet的性能有提升作用,而对AlexNet则没有。这表明不同的特征对我们的模型有不同的影响。像AlexNet中最后几层的特征可能在域判别方面不够有效,特别是在源域内变化有限的情况下。在这些情况下,软分配可以为样本聚类提供更有效的策略。

Table 2.10. VPC dataset: average accuracy per class.
表2.10. VPC数据集:每个类别的平均准确率。

NetH1H2H3H4H5H6avg.
AlexNet49.853.449.264.441.043.450.2
AlexNet + BN54.554.655.669.741.845.953.7
AlexNet + WBN54.751.961.870.643.946.554.9
AlexNet + WBN*53.554.655.768.144.349.954.3
ResNet55.847.464.069.942.850.455.0
ResNet + WBN55.749.564.770.242.152.055.7
ResNet + WBN*56.850.964.169.345.151.656.5
网络H1H2H3H4H5H6平均值
亚历克斯网络(AlexNet)49.853.449.264.441.043.450.2
亚历克斯网络 + 批量归一化(AlexNet + BN)54.554.655.669.741.845.953.7
亚历克斯网络 + 加权批量归一化(AlexNet + WBN)54.751.961.870.643.946.554.9
亚历克斯网络 + 改进的加权批量归一化(AlexNet + WBN*)53.554.655.768.144.349.954.3
残差网络(ResNet)55.847.464.069.942.850.455.0
残差网络 + 加权批量归一化(ResNet + WBN)55.749.564.770.242.152.055.7
残差网络 + 改进的加权批量归一化(ResNet + WBN*)56.850.964.169.345.151.656.5

Table 2.11. VPC dataset: comparison with state of the art.
表2.11. VPC数据集:与现有技术的比较。

Method[273][65][290]AlexNetResNet
Config.SIFTCEBF--WBN*BNWBN*
Acc.35.041.945.645.950.050.253.754.355.056.5
方法[273][65][290]亚历克斯网络(AlexNet)残差网络(ResNet)
配置(Config.)尺度不变特征变换(SIFT)交叉熵(CE)暴力匹配(BF)--加权批量归一化(WBN*)批量归一化(BN)加权批量归一化(WBN*)
准确率(Acc.)35.041.945.645.950.050.253.754.355.056.5


Finally, Table 2.11 compares the results obtained with WBN with those of state-of-the-art methods. Specifically we consider the method in [273], where SIFT [158] and CENTRIST (CE) features [274] are provided as input to a nearest neighbor classifier, and the approach in [65], where the same classifier is employed but using Histogram of Oriented Uniform Patterns (HOUP) as input. For sake of completeness, we also report the results obtained by exploiting also the temporal information between images. For this setting, we report the performances of the CENTRIST-based approach of [274] coupled with Bayesian Filtering (BF) and the results of [290] which used again a Bayesian Filter together with object templates. As shown in the Table, applying deep-learning techniques already guarantees an increase in performances of about 4% with respect to the state of the art. Introducing WBN inside the network, allows a further accuracy gain.
最后,表2.11将使用加权批量归一化(WBN)获得的结果与最先进方法的结果进行了比较。具体而言,我们考虑了文献[273]中的方法,该方法将尺度不变特征变换(SIFT,Scale-Invariant Feature Transform)[158]和中心特征(CENTRIST,Center-Surround Extremely Localized Image Feature)(CE)特征[274]作为最近邻分类器的输入;以及文献[65]中的方法,该方法采用了相同的分类器,但使用定向均匀模式直方图(HOUP,Histogram of Oriented Uniform Patterns)作为输入。为了完整起见,我们还报告了利用图像间时间信息所获得的结果。对于这种设置,我们报告了文献[274]中基于CENTRIST的方法与贝叶斯滤波(BF,Bayesian Filtering)相结合的性能,以及文献[290]的结果,该文献再次使用了贝叶斯滤波器和目标模板。如表所示,应用深度学习技术相对于现有技术已经保证了约4%的性能提升。在网络中引入加权批量归一化(WBN),可以进一步提高准确率。

Experiments on a large-scale scenario: SPED. In this section we show the results obtained when WBN is applied to a large scale dataset of outdoor scenes, i.e. the SPED dataset. In order to utilize SPED as a DG benchmark, we split the dataset in two sets, February and August, considering the months of data acquisition. Since no other automatic training data splits are possible using timestamps, in these experiments we do not use domain supervision and only consider WBN with two latent domains. The choice of having two domains is motivated by the fact that the dataset contains images collected at different times of the day and thus we assume that the latent domains automatically discovered by our method correspond to "night" and "day".
大规模场景实验:SPED数据集。在本节中,我们展示了将加权批量归一化(WBN)应用于大规模户外场景数据集(即SPED数据集)时所获得的结果。为了将SPED数据集用作领域泛化(DG,Domain Generalization)基准,我们根据数据采集月份将数据集分为两组,即二月组和八月组。由于使用时间戳无法进行其他自动训练数据划分,因此在这些实验中我们不使用领域监督,仅考虑具有两个潜在领域的加权批量归一化(WBN)。选择两个领域的动机在于,该数据集包含了一天中不同时间收集的图像,因此我们假设我们的方法自动发现的潜在领域对应于“夜间”和“白天”。

Table 2.12. SPED dataset: comparison of different models.
表2.12. SPED数据集:不同模型的比较。

NetAMOSNetAlexNet
Config.BaseBNWBNBaseBNWBN
February-to-August83.788.890.383.688.990.5
August-to-February71.282.786.173.983.187.0
网络阿莫斯网络(AMOSNet)亚历克斯网络(AlexNet)
配置(Config.)基础批量归一化(BN)加权批量归一化(WBN)基础批量归一化(BN)加权批量归一化(WBN)
2月至8月83.788.890.383.688.990.5
8月至2月71.282.786.173.983.187.0


Results are shown in Table 2.12. WBN provides a clear gain in all considered settings and for all considered architectures. The improvement of 4% obtained in the case "August-to-February" for both networks is remarkable given the very large number of classes and the lack of domain supervision.
结果如表2.12所示。在所有考虑的设置和架构中,WBN(加权批量归一化,Weighted Batch Normalization)都带来了显著的提升。考虑到类别数量众多且缺乏领域监督,在“八月至二月”的情况下,两个网络均实现了4%的改进,这一结果十分显著。

2.5.4 From BN to Classifiers: Best Sources Forward 10
2.5.4 从批量归一化到分类器:最佳源前馈 10


In Section 2.5.2, we discussed how to address DG given a domain classification branch and domain-specific (either latent or explicit) normalization layers. However, the same approach can be applied, in principle, to other parts of the network. In this subsection, we describe how the same methodology can be applied to domain-specific classification layers (Fig. 2.13).
在2.5.2节中,我们讨论了在给定领域分类分支和特定领域(潜在或显式)归一化层的情况下,如何解决领域泛化(Domain Generalization,DG)问题。然而,原则上,相同的方法也可应用于网络的其他部分。在本小节中,我们将描述如何将相同的方法应用于特定领域的分类层(图2.13)。

The approach devised in the previous section requires three components: (i) a way to estimate domain membership of a sample both at training and at test time, (ii) a distinction between domain-specific and domain-agnostic network elements, and (iii) a strategy to merge domain-specific activations within the network. The first point can be easily addressed through a domain classifier, as described in Sections 2.5.2 and 2.4.3.
上一节设计的方法需要三个组件:(i)一种在训练和测试时估计样本所属领域的方法;(ii)区分特定领域和领域无关的网络元素;(iii)一种在网络中合并特定领域激活的策略。如2.5.2节和2.4.3节所述,第一点可以通过领域分类器轻松解决。

For what concerns the second point, we can write our classification model as fCΘ ,where Θ={θj}j=1ks denotes the set of parameters to learn and each θj are the parameters corresponding to a specific domain j . Moreover,let us consider θj={θ^s,θ^j} ,where θ^s indicates the parameters shared by all domain-specific models and θ^j the domain-specific ones. Under this formulation,in Section 2.5.2, θ^s were all the parameters of the network while θ^j the domain-specific BN statistics. In this section,we change perspective and we assume θ^s to be a feature extractor and θ^j to be domain-specific semantic classification heads. Note that the formulation is general and can be applied to multiple/different levels of the network.
关于第二点,我们可以将分类模型写为 fCΘ ,其中 Θ={θj}j=1ks 表示待学习的参数集,每个 θj 是对应于特定领域 j 的参数。此外,让我们考虑 θj={θ^s,θ^j} ,其中 θ^s 表示所有特定领域模型共享的参数, θ^j 表示特定领域的参数。在这种表述下,在2.5.2节中, θ^s 是网络的所有参数,而 θ^j 是特定领域的批量归一化统计量。在本节中,我们改变视角,假设 θ^s 是一个特征提取器, θ^j 是特定领域的语义分类头。请注意,这种表述是通用的,可以应用于网络的多个/不同层次。

Now that we have defined the domain-specific component we must define how to merge activations of the domain-specific layers. During training, the most simple strategy would be to rely on the domain-label of the sample, namely:
现在我们已经定义了特定领域的组件,接下来必须定义如何合并特定领域层的激活。在训练期间,最简单的策略是依赖样本的领域标签,即:

(2.12)fCΘ(xi)=j=1ks1si=sjfj(xi)


10 M. Mancini,S. Rota Bulò,B. Caputo,E. Ricci. Best sources forward: domain generalization through source-specific nets. IEEE International Conference on Image Processing (ICIP) 2018.
10 M. Mancini,S. Rota Bulò,B. Caputo,E. Ricci。最佳源前馈:通过特定源网络实现领域泛化。IEEE国际图像处理会议(ICIP)2018。


where we wrote fCθj as fj for simplicity. Similarly to Eq. (2.9),also this equation cannot be applied at test time, when the domain membership of a sample is unknown and falls out the space of the available source domains Ds . Similarly to what we did in Section 2.5.2, we can use a soft version of Eq. (2.12) at test time:
为了简单起见,我们将 fCθj 写为 fj 。与式(2.9)类似,当样本的领域归属未知且超出可用源领域 Ds 的范围时,这个方程在测试时也无法应用。与我们在2.5.2节中所做的类似,我们可以在测试时使用式(2.12)的软版本:

(2.13)fCΘ(xi)=j=1kswi,jfj(xi,θj)



https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_70.jpg?x=380&y=294&w=887&h=464&r=0

Figure 2.13. Intuition behind the proposed BSF framework. Different domain-specific classifiers and the classifiers fusion are learned at training time on source domains, in a single end-to-end trainable architecture. When a target image is processed, our deep model optimally combines the source models in order to compute the final prediction.
图2.13. 所提出的BSF(Best Sources Forward)框架的原理。在训练阶段,在源领域上学习不同的特定领域分类器以及分类器的融合,采用单一的端到端可训练架构。当处理目标图像时,我们的深度模型会最优地组合源模型以计算最终预测结果。


where wi,j=fDΘ is the probability of sample i to belong to domain j ,as computed by our domain classifier. The model is trained with the same semantic and domain classification loss defined Eq. (2.11).
其中 wi,j=fDΘ 是样本 i 属于领域 j 的概率,由我们的领域分类器计算得出。该模型使用式(2.11)中定义的相同语义和领域分类损失进行训练。

We highlight that, differently from Sections 2.4 and 2.5.2, here the merging of the domain-specific activations/components is held-out after and not within the feature extraction process. Moreover, we found a simple modification being beneficial in this scenario. In particular,we introduce an hyperparameter 0<α<1 and we re-write the classification function as follows:
我们强调,与2.4节和2.5.2节不同,这里特定领域激活/组件的合并是在特征提取过程之后进行的,而不是在特征提取过程中进行。此外,我们发现在这种情况下进行一个简单的修改会带来好处。具体来说,我们引入一个超参数 0<α<1 ,并将分类函数重写如下:

(2.14)fCΘ(xi)=(1α)j=1kswi,jfj(xi)+αksj=1ksfj(xi).

In practice, α allows to merge domain-specific component both exploiting the similarity among domains with its first term (as in Eq. (2.12)), while considering domain agreements on the predictions, weighting them equally with the second term. This allows the model to be robust to inaccurate domain assignment at test time while increasing the feedback to domain-specific models for source sets with few samples. In practice,during training we randomly switch with probability α between using the given domain label as wij or assigning to all domain-classifier a uniform weight 1/ks . At test time,we use Eq. (2.14) with wi,j obtained from the domain prediction branch. As the experiments show, this choice allows us to obtain a more robust final classification model. Figure 2.14 provides an overview of our model.
在实践中,α 允许合并特定领域的组件,其第一项(如式 (2.12) 所示)利用领域间的相似性,同时考虑预测上的领域一致性,并通过第二项对它们进行同等加权。这使得模型在测试时对不准确的领域分配具有鲁棒性,同时为样本较少的源集的特定领域模型增加反馈。在实践中,在训练期间,我们以概率 α 随机在将给定的领域标签用作 wij 或为所有领域分类器分配统一权重 1/ks 之间进行切换。在测试时,我们使用式 (2.14),其中 wi,j 从领域预测分支获得。正如实验所示,这种选择使我们能够获得更鲁棒的最终分类模型。图 2.14 提供了我们模型的概述。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_71.jpg?x=376&y=260&w=901&h=572&r=0

Figure 2.14. Simplified architecture of the proposed BSF framework. The input image is fed to a series of domain-specific classifiers and to the domain prediction branch. The latter produces the assignment w which is fed to the domain prediction loss. The same w is modulated by α before being used to combine the output of each classifier. The final output of the architecture z ,is fed to the classification loss.
图 2.14. 所提出的 BSF 框架的简化架构。输入图像被输入到一系列特定领域的分类器和领域预测分支。后者产生分配 w,该分配被输入到领域预测损失。相同的 w 在用于组合每个分类器的输出之前由 α 进行调制。架构的最终输出 z 被输入到分类损失。


2.5.5 Experiments: Domain Generalization in Computer Vision
2.5.5 实验:计算机视觉中的领域泛化


Datasets. We test the performance of BSF on two publicly available benchmarks. The first is rotated-MNIST [79], a dataset composed by different domains originated applying different degrees of rotations to images of the original MNIST digits dataset [131]. We follow the experimental protocol of [185], randomly extracting 1000 images per class from the dataset and rotating them respectively of0,15,30, 45, 60 and 75 degrees counterclockwise. As previous works, we consider one domain as target and the rest as sources.
数据集。我们在两个公开可用的基准测试上测试 BSF 的性能。第一个是旋转 MNIST 数据集 [79],这是一个由不同领域组成的数据集,这些领域是通过对原始 MNIST 数字数据集 [131] 的图像应用不同程度的旋转而产生的。我们遵循文献 [185] 的实验方案,从数据集中随机为每个类别提取 1000 张图像,并分别将它们逆时针旋转 0、15、30、45、60 和 75 度。与之前的工作一样,我们将一个领域视为目标领域,其余视为源领域。

The second is PACS [133] the same database we used on the latent domain discovery section. Differently from Section 2.4.5, we consider a domain generalization setting (i.e. no target data available during training). Following the experimental protocol of [133], we train our model considering three domains as source datasets and the remaining one as target.
第二个是 PACS 数据集 [133],这与我们在潜在领域发现部分使用的数据库相同。与 2.4.5 节不同,我们考虑领域泛化设置(即训练期间没有目标数据可用)。遵循文献 [133] 的实验方案,我们将三个领域作为源数据集,其余一个作为目标数据集来训练我们的模型。

Networks and training protocols. In our evaluation we set the parameters α=0.25 and λ=0.5 . For the experiments on the rotated-MNIST dataset,we employ the LeNet architecture [131] following [185]. The network is trained from scratch, using a batch size of 250 with an equal number of samples for each source domain. We train the network for 10000 iterations, using Stochastic Gradient Descent (SGD) with an initial learning rate of 0.01, momentum 0.9 and weight decay 0.0005 . The learning rate is decayed through an inverse schedule, following previous works [77]. For the domain prediction branch, we take as input the image and perform two convolutions, with the same parameters of the first two convolutional layers of the main network. Each convolution is followed by a ReLU non linearity and a pooling operation. The domain prediction branch follows the implementations of the previous sections. It terminates with a global average pooling followed by a fully connected layer which outputs the final weights. To ensure that j=1Nwi,j=1 ,we apply the softmax operator after the fully connected layer.
网络和训练协议。在我们的评估中,我们设置参数 α=0.25λ=0.5。对于旋转 MNIST 数据集的实验,我们遵循文献 [185] 采用 LeNet 架构 [131]。网络从头开始训练,使用批量大小为 250,每个源领域的样本数量相等。我们使用随机梯度下降法 (SGD) 以 0.01 的初始学习率、0.9 的动量和 0.0005 的权重衰减对网络进行 10000 次迭代训练。学习率通过逆调度进行衰减,遵循之前的工作 [77]。对于领域预测分支,我们将图像作为输入并执行两次卷积,其参数与主网络的前两个卷积层相同。每次卷积后都有一个 ReLU 非线性激活和一个池化操作。领域预测分支遵循前面章节的实现。它以全局平均池化结束,然后是一个全连接层,该层输出最终权重。为了确保 j=1Nwi,j=1,我们在全连接层之后应用 softmax 算子。

Table 2.13. Rotated-MNIST dataset: comparison with previous methods.
表 2.13. 旋转 MNIST 数据集:与先前方法的比较。

Method01530456075Mean
CAE 22172.195.392.681.592.779.385.5
MTAE [79]82.596.393.478.694.280.587.5
CCSA [185]84.695.694.682.994.882.189.1
BSF85.695.095.695.595.984.392.0
方法01530456075均值
CAE 22172.195.392.681.592.779.385.5
MTAE [79]82.596.393.478.694.280.587.5
CCSA [185]84.695.694.682.994.882.189.1
BSF85.695.095.695.595.984.392.0


For PACS, we trained the standard AlexNet architecture, starting from the ImageNet pre-trained model. We use a batch size of 192, with 64 samples for each source domain. The initial learning rate is set to 5104 with a weight decay of 106 and a momentum of 0.9 . We train the network for 3000 iterations,decaying the initial learning rate by a factor of 10 after 2500 iterations, using SGD. For the domain prediction branch, we use the features of pool5 as input, performing a global average pooling followed by a fully-connected layer and a softmax operator which outputs the domain weights.
对于PACS(领域自适应基准数据集),我们从ImageNet预训练模型开始,训练了标准的AlexNet架构。我们使用的批量大小为192,每个源域有64个样本。初始学习率设置为5104,权重衰减为106,动量为0.9。我们使用随机梯度下降法(SGD)对网络进行3000次迭代训练,在2500次迭代后将初始学习率降低10倍。对于领域预测分支,我们使用池化层5(pool5)的特征作为输入,先进行全局平均池化,然后通过一个全连接层和一个softmax算子,输出领域权重。

Our evaluation is performed using a NVIDIA GeForce 1070 GTX GPU, implementing all the models with the popular Caffe [111] framework. For the baseline AlexNet architecture we take the pre-trained model available in Caffe.
我们使用NVIDIA GeForce 1070 GTX GPU进行评估,使用流行的Caffe [111]框架实现所有模型。对于基线AlexNet架构,我们采用Caffe中可用的预训练模型。

Results on Rotated-MNIST. We first test the effectiveness of our model on the rotated-MNIST benchmark. We compare BSF with the CCSA method in [185] and the multi-task autoencoders in [79] (MTAE) and [221] (CAE). The results from baseline methods are taken directly from [185]. As shown in Table 2.13, BSF outperforms all the baselines. A remarkable gain in accuracy is achieved in the 45o case. We ascribe this gain to the capability of our deep network to assign,for each target image, more importance to the source domains corresponding to the closest orientations, increasing the weights of the associated classifiers. Indeed, since 45o is in the middle of the range between all possible orientations,it is likely that a stronger classifier can be constructed since we can exploit all the source models appropriately re-weighted. To further verify the effectiveness of our framework and its ability to properly combine source-specific models, we also compute for target samples with different orientations the number of assignments to each source domain. In this experiment one target sample xi is assigned to a source domain by computing the argmaxjwi,j . The results are shown in Fig. 2.15 (the number of assignments are normalized for each row). The figure clearly shows that the proposed domain prediction branch tends to associate a target sample to the source domains corresponding to the closest orientations. Consequently, our deep network classifies target samples constructing a model from the most related source classifiers. This results into more accurate predictions than previous domain-agnostic models due to the specialization of source classifiers on specific orientations.
旋转MNIST数据集(Rotated - MNIST)上的结果。我们首先在旋转MNIST基准测试中测试我们模型的有效性。我们将BSF(基于源特征的方法)与文献[185]中的CCSA方法以及文献[79](多任务自编码器,MTAE)和[221](卷积自编码器,CAE)中的多任务自编码器进行比较。基线方法的结果直接取自文献[185]。如表2.13所示,BSF优于所有基线方法。在45o的情况下,准确率有显著提升。我们将这一提升归因于我们的深度网络能够为每个目标图像赋予与最接近方向对应的源域更多的重要性,从而增加相关分类器的权重。实际上,由于45o处于所有可能方向范围的中间位置,我们可以适当地重新加权所有源模型,因此很可能构建出更强的分类器。为了进一步验证我们框架的有效性及其正确组合特定源模型的能力,我们还计算了不同方向的目标样本分配到每个源域的数量。在这个实验中,通过计算argmaxjwi,j将一个目标样本xi分配到一个源域。结果如图2.15所示(每行的分配数量进行了归一化处理)。该图清楚地表明,所提出的领域预测分支倾向于将目标样本与最接近方向对应的源域相关联。因此,我们的深度网络通过从最相关的源分类器构建模型来对目标样本进行分类。由于源分类器在特定方向上的专业化,这比以前的领域无关模型产生了更准确的预测。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_73.jpg?x=372&y=255&w=916&h=566&r=0

Figure 2.15. Rotated-MNIST dataset: analysis of the assignments computed by the domain prediction branch.
图2.15. 旋转MNIST数据集:领域预测分支计算的分配分析。


Results on PACS. We also perform experiments on the PACS dataset. We compare BSF with both previous approaches using precomputed features (in this case DECAF-6 features [59]) as input and end-to-end trainable deep models. For the baselines with pre-computed features, we report the results of MTAE [79], low-rank exemplar SVMs (LRE-SVM) [288], and uDICA [186]), while for the end-to-end trainable deep models, we report the results of the domain agnostic model coupled with tensor factorization of [133] (TF-CNN) and the meta-learning approach MLDG [144]. For a fair comparison the deep models [133, 144] and our network are all based on the same architecture, i.e. AlexNet. Table 2.14 shows the results of our comparison. The performance of previous methods are taken directly from previous papers [133, 144]. For our approach and [133] we also report results obtained without fine-tuning. Our model outperforms all previous methods. These results are remarkable because, differently from the rotated-MNIST dataset, in PACS the domain shift is significant and it is not originated by simple image perturbations. Therefore, the association between a target sample and the given source domains is more subtle to capture. For sake of completeness we also report the performances obtained with the standard AlexNet network. These results shows that state of the art deep models have excellent generalization abilities, typically outperforming shallow models. However, designing deep networks specifically addressing the DG problem as we do leads to higher accuracy.
PACS数据集上的结果。我们还在PACS数据集上进行了实验。我们将BSF与以前使用预计算特征(在这种情况下为DECAF - 6特征[59])作为输入的方法以及端到端可训练的深度模型进行比较。对于使用预计算特征的基线方法,我们报告了MTAE [79]、低秩样本支持向量机(LRE - SVM)[288]和uDICA [186]的结果,而对于端到端可训练的深度模型,我们报告了文献[133]中结合张量分解的领域无关模型(TF - CNN)和元学习方法MLDG [144]的结果。为了进行公平比较,深度模型[133, 144]和我们的网络都基于相同的架构,即AlexNet。表2.14显示了我们的比较结果。以前方法的性能直接取自以前的论文[133, 144]。对于我们的方法和文献[133],我们还报告了未进行微调时获得的结果。我们的模型优于所有以前的方法。这些结果非常显著,因为与旋转MNIST数据集不同,在PACS数据集中,领域偏移非常显著,并且不是由简单的图像扰动引起的。因此,目标样本与给定源域之间的关联更难捕捉。为了完整起见,我们还报告了使用标准AlexNet网络获得的性能。这些结果表明,最先进的深度模型具有出色的泛化能力,通常优于浅层模型。然而,像我们这样专门针对领域泛化(DG)问题设计深度网络可以获得更高的准确率。

We also perform a sensitivity analysis to study the impact of the parameter α on the performance and demonstrate the benefit of adding a domain-agnostic classifier. We consider the proposed approach without fine-tuning. As shown in Table 2.15, considering only the source-specific classifiers (α=0) leads,on average,to the best performances, surpassing in the majority of the cases a domain agnostic classifier obtained by setting α=1 . This confirms our original intuition that addressing DG by fusing multiple source models is an effective strategy. However, there are few situations where using only source models can lead to a decrease in accuracy (e.g. in the setting Cartoon) and incorporating a domain-agnostic component, even with reduced weight as α=0.25 ,improves generalization accuracy.
我们还进行了敏感性分析,以研究参数α对性能的影响,并展示添加与领域无关的分类器的益处。我们考虑不进行微调的所提出方法。如表2.15所示,仅考虑特定源分类器(α=0)平均而言会带来最佳性能,在大多数情况下超过了通过设置α=1获得的与领域无关的分类器。这证实了我们最初的直觉,即通过融合多个源模型来解决领域泛化(DG)问题是一种有效的策略。然而,在少数情况下,仅使用源模型会导致准确率下降(例如在卡通设置中),而纳入一个与领域无关的组件,即使权重如α=0.25那样降低,也能提高泛化准确率。

Table 2.14. PACS dataset: comparison with previous methods.
表2.14. PACS数据集:与先前方法的比较。

ModelArtCartoonPhotoSketchMean
MTAE [79]60.358.791.147.964.5
LRE-SVM [288]59.752.885.537.959.0
uDICA [186]64.664.591.851.168.0
TF-CNN[133] (no ft)62.752.788.852.264.1
TF-CNN[133]62.967.089.557.569.2
MLDG [144]66.266.988.059.070.0
BSF (no ft)64.160.690.449.466.1
BSF64.166.890.260.170.3
AlexNet [133]63.363.187.754.167.1
模型艺术卡通照片素描均值
多任务注意力嵌入模型(MTAE) [79]60.358.791.147.964.5
局部相对熵支持向量机(LRE - SVM) [288]59.752.885.537.959.0
无监督深度独立分量分析(uDICA) [186]64.664.591.851.168.0
迁移特征卷积神经网络(TF - CNN)[133](无微调)62.752.788.852.264.1
迁移特征卷积神经网络(TF - CNN)[133]62.967.089.557.569.2
多任务学习域泛化(MLDG) [144]66.266.988.059.070.0
双尺度特征(BSF)(无微调)64.160.690.449.466.1
双尺度特征(BSF)64.166.890.260.170.3
亚历克斯网络(AlexNet) [133]63.363.187.754.167.1

Table 2.15. PACS dataset: sensitivity analysis.
表2.15. PACS数据集:敏感性分析。

αArtCartoonPhotoSketch
065.254.590.752.4
0.2564.160.690.449.4
0.563.861.090.449.1
0.7564.060.990.547.8
163.060.190.547.5
α艺术卡通照片素描
065.254.590.752.4
0.2564.160.690.449.4
0.563.861.090.449.1
0.7564.060.990.547.8
163.060.190.547.5


2.5.6 Conclusions
2.5.6 结论


In this section, we presented two deep learning models for addressing DG. The first, WBN, exploits a weighted formulation of BN to learn robust classifiers that can be applied to previously unseen target domains. We showed how this approach is effective in the context of semantic place categorization in robotics, achieving state-of-the-art performance on the VPC benchmark. The effectiveness of WBN is also confirmed by experiments on a large scale dataset of outdoor scenes.
在本节中,我们提出了两种用于解决领域泛化(Domain Generalization,DG)问题的深度学习模型。第一种是加权批量归一化(Weighted Batch Normalization,WBN),它利用批量归一化(Batch Normalization,BN)的加权公式来学习鲁棒的分类器,这些分类器可以应用于之前未见过的目标领域。我们展示了这种方法在机器人语义场景分类的背景下是如何有效的,并在视觉场景分类(Visual Place Categorization,VPC)基准测试中取得了最先进的性能。在大规模户外场景数据集上的实验也证实了WBN的有效性。

The second, BSF, addresses the problem of DG by exploiting multiple domain-specific classifiers. In particular, it extends the principles of WBN, with a domain prediction branch choosing the optimal combination of source classifiers to use at test time, based on the similarity between the input image and the samples from the source domains. Differently from WBN, it goes beyond domain-specific BN layers but explores domain-specific classification modules. Moreover, a domain agnostic component is also introduced in our framework further improving the performance of the method. Experiments demonstrate the effectiveness of BSF which outperformed the state-of-the-art models on two benchmarks (at the time of submission).
第二种是基于多分类器的领域泛化方法(Branch Selection for Domain Generalization,BSF),它通过利用多个特定领域的分类器来解决DG问题。具体来说,它扩展了WBN的原理,增加了一个领域预测分支,该分支根据输入图像与源领域样本之间的相似性,在测试时选择源分类器的最优组合。与WBN不同的是,它不仅仅局限于特定领域的BN层,还探索了特定领域的分类模块。此外,我们的框架中还引入了一个领域无关的组件,进一步提高了该方法的性能。实验证明了BSF的有效性,在提交论文时,它在两个基准测试中优于最先进的模型。

With WBN and BSF, we have merged domain-specific models either at the BNs-level or at the classifiers one, due to the ease of linearly combining their parameters/statistic (WBN) and predictions (BSF). In future works, it would be interesting to blend domain-specific models at different levels of the networks, as explored in other works in contexts such as multi-task learning [181], life-long learning [4] and motion control [306, 302].
通过WBN和BSF,由于可以轻松地线性组合它们的参数/统计量(WBN)和预测结果(BSF),我们在BN层或分类器层合并了特定领域的模型。在未来的工作中,像在多任务学习[181]、终身学习[4]和运动控制[306, 302]等其他工作中所探索的那样,在网络的不同层次融合特定领域的模型将是很有趣的。
A drawback of both WBN and BSF is the assumption that multiple and diverse source domains are available at training time. This may be not always possible due to costly or even unfeasible data collection processes. Other recent approaches overcome this issue by considering external sources of knowledge such as automatically-generated training data [176] and online annotators [247]. Generating synthetic data for the target task could be a huge advantage for training deep models, but requires the knowledge of the target task beforehand, something that is not assumed by our model. A possible solution to this issue consists of endowing the robot with the ability of access, on-demand, additional information about the target data. Indeed, the generality of our framework allows the integration of external sources of knowledge (e.g. generating multiple domains through web queries or synthetic data). Finally, a major drawback of DG models is the need for multiple labeled source domains during training. In the next sections, we will show how we can drop the assumption of having multiple source domains by extending the DA-layers models to the Continuous DA and Predictive DA scenarios.
WBN和BSF的一个缺点是假设在训练时可以获得多个不同的源领域数据。由于数据收集过程成本高昂甚至不可行,这可能并不总是可行的。其他近期的方法通过考虑外部知识源,如自动生成的训练数据[176]和在线标注器[247],克服了这个问题。为目标任务生成合成数据对训练深度模型可能有很大的优势,但需要事先了解目标任务,而我们的模型并没有这样的假设。解决这个问题的一个可能的方法是让机器人能够按需获取有关目标数据的额外信息。实际上,我们的框架的通用性允许集成外部知识源(例如,通过网络查询或合成数据生成多个领域)。最后,DG模型的一个主要缺点是在训练期间需要多个带标签的源领域数据。在接下来的章节中,我们将展示如何通过将领域自适应层(Domain Adaptation - layers,DA - layers)模型扩展到连续领域自适应(Continuous Domain Adaptation)和预测性领域自适应(Predictive Domain Adaptation)场景,来放弃拥有多个源领域的假设。

2.6 Continuous Domain Adaptation 11
2.6 连续领域自适应 11


Despite the remarkable performances achieved by DA algorithms in computer vision [142, 28], and their growing popularity in robot vision [7] they require the presence of images from the target domain in advance during training. This is a huge limitation, especially in robotics, due to the likely unpredictable conditions of the environment in which a robot will be employed. In the previous section, we have seen how we can sidestep the need for target data during training in case we are given a set of multiple labeled source domains, addressing the DG scenario. However, this setting has also a limitation: the need of collecting (and labeling) data of multiple source domains. In this section, we want to overcome this issue, performing adaption given just a single source domain during training, without any target domain data. This setting, Continuous DA [103], requires to cope with the domain shift directly at test time, as the model processes data of the target domain.
尽管领域自适应(Domain Adaptation,DA)算法在计算机视觉[142, 28]中取得了显著的性能,并且在机器人视觉[7]中越来越受欢迎,但它们在训练期间需要提前有目标领域的图像。这是一个巨大的限制,特别是在机器人领域,因为机器人所使用的环境条件可能是不可预测的。在上一节中,我们看到了在给定一组多个带标签的源领域数据的情况下,如何在训练期间避开对目标数据的需求,从而解决DG场景的问题。然而,这种设置也有一个局限性:需要收集(并标注)多个源领域的数据。在本节中,我们想克服这个问题,在训练期间仅使用单个源领域数据进行自适应,而不需要任何目标领域数据。这种设置,即连续领域自适应[103],要求在模型处理目标领域数据时,直接在测试时应对领域偏移问题。

Here, we consider a realistic application scenario for Continuous DA algorithms: the task of kitting. This task is the process of grouping related parts such as gathering components of a personal computer (PC) into one bin for assembly [12]. The kitting task requires the recognition of the parts in the environment, the ability to pick objects from the bins, and placing them at the correct location [105]. All of these subtasks are very challenging on their own but the recognition of the parts is crucial for the robot to sequentially perform the other subtasks. Already in today's factory settings, object recognition tasks possess challenges such as environmental effects (illumination, viewpoint, etc), varying object material properties, and cluttered scenes [149]. In order to simplify the recognition task, some approaches use machine vision in rather isolated settings for decreasing the environmental variability [236]. Liu et. al [149] proposed a specially designed camera system and estimation based on 3D CAD models to robustly detect and verify the type and the pose of the object. Kaiba et. al. [112] proposed an interactive method where a remote human operator resolves ambiguities in the perception system. Unfortunately, none of the above methods are generic enough to be applied in a truly unconstrained setting. In this section, we are primarily concerned with solving the object recognition problem for kitting using vision in the wild, i.e. in non-isolated settings exhibiting large variations. Right now, most of the robots in the manufacturing industry are operating in isolation, primarily because of safety concerns. However, many future scenarios have robots and humans working closer together, moving robots into new areas of applications, beyond mass production and preprogrammed behavior. For this to happen, not only safety but perception will be a major challenge.
在这里,我们考虑连续域适应(Continuous DA)算法的一个现实应用场景:配套任务。该任务是将相关零件进行分组的过程,例如将个人计算机(PC)的组件收集到一个箱子中以便进行组装 [12]。配套任务需要识别环境中的零件,具备从箱子中拾取物体并将其放置到正确位置的能力 [105]。所有这些子任务本身都极具挑战性,但零件识别对于机器人依次执行其他子任务至关重要。在当今的工厂环境中,物体识别任务面临着诸如环境影响(光照、视角等)、物体材料属性各异以及场景杂乱等挑战 [149]。为了简化识别任务,一些方法在相对孤立的环境中使用机器视觉来减少环境变异性 [236]。刘等人 [149] 提出了一种专门设计的相机系统,并基于三维计算机辅助设计(3D CAD)模型进行估计,以稳健地检测和验证物体的类型和姿态。凯巴等人 [112] 提出了一种交互式方法,由远程人工操作员解决感知系统中的模糊问题。不幸的是,上述方法都不够通用,无法应用于真正无约束的环境。在本节中,我们主要关注利用野外视觉(即存在较大变化的非孤立环境)解决配套任务中的物体识别问题。目前,制造业中的大多数机器人都是孤立运行的,这主要是出于安全考虑。然而,未来的许多场景中,机器人和人类将更紧密地合作,使机器人进入大规模生产和预编程行为之外的新应用领域。要实现这一点,不仅安全问题,感知能力也将是一项重大挑战。

In this section, we describe two main contributions. The first is a kitting dataset that contains images of objects taken under varying illumination, viewpoint, and background conditions from a robotic platform. This dataset provides the community with a novel tool for studying the robustness of robot vision algorithms to drastic changes in the appearance of the input images and assess progress in the field. We are not aware of existing, publicly available kitting databases covering this range of visual variability.
在本节中,我们介绍两项主要贡献。第一项是一个配套数据集,其中包含从机器人平台在不同光照、视角和背景条件下拍摄的物体图像。该数据集为研究界提供了一个新颖的工具,用于研究机器人视觉算法对输入图像外观剧烈变化的鲁棒性,并评估该领域的进展。据我们所知,目前尚无涵盖如此广泛视觉变异性的公开可用配套数据库。


11M . Mancini,H. Karaoguz,E. Ricci,P. Jensfelt,B. Caputo. Kitting in the Wild through Online Domain Adaptation. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2018.
11M . 曼奇尼,H. 卡拉奥古兹,E. 里奇,P. 延斯费尔特,B. 卡普托。通过在线域适应实现野外配套任务。2018 年电气与电子工程师协会/日本机器人协会智能机器人与系统国际会议(IEEE/RSJ International Conference on Intelligent Robots and Systems,IROS)。




https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_77.jpg?x=279&y=258&w=1099&h=567&r=0

Figure 2.16. Our ONDA approach for performing kitting in arbitrary conditions. Given a training set, we can train a robot vision model offline. As the robot performs the task, we gradually adapt the visual model to the current working conditions, in an online fashion and without requiring target data during the offline training phase.
图 2.16. 我们用于在任意条件下执行配套任务的在线域适应(ONDA)方法。给定一个训练集,我们可以离线训练一个机器人视觉模型。当机器人执行任务时,我们以在线方式逐步使视觉模型适应当前的工作条件,并且在离线训练阶段不需要目标数据。


Second, we describe an approach for achieving online adaptation of a deep model [166]. Differently from classical DA approaches, this algorithm can adapt a deep model to any target domain on the fly, without requiring any target domain data before-hand. We benchmark the performance of our algorithm on the proposed dataset, showing how this model is able to produce large improvements on the target domain performances compared to the base architecture trained on the source domain, and matching what would be achieved by having all data from the target available beforehand.
第二项,我们介绍一种实现深度模型在线适应的方法 [166]。与传统的域适应(DA)方法不同,该算法可以实时将深度模型适应任何目标域,而无需事先获取任何目标域数据。我们在提出的数据集上对我们的算法性能进行基准测试,结果表明,与在源域上训练的基础架构相比,该模型能够显著提高目标域的性能,并且与事先获取目标域的所有数据所达到的效果相当。

2.6.1 The KTH Handtool Dataset
2.6.1 皇家理工学院(KTH)手动工具数据集


The KTH Handtool Dataset 12 is collected for evaluating the object recognition/detection performance of robot vision methods in varying viewpoint, illumination and background settings, all crucial abilities for robot kitting in unconstrained, real-world settings. Instead of having general household objects, the dataset consists of hand tools in order to represent a workshop setting in a factory. It consists of 9 different hand tools for 3 different categories; hammer, plier and screwdriver. The images are collected with a 2-arm stationary robot platform shown in Fig. 2.17. Dataset consists of 3 different illuminations, 2 different cameras (One Kinect camera and one webcam) with different viewpoints and 2 different background settings that correspond to 12 (3x3x2) domains in total. For each hand tool,approximately 40 images with different poses are collected for each camera and domain setting. Table 2.16 shows example images from different domains. In total, approximately 4500 RGB images are available in the dataset.
皇家理工学院(KTH)手动工具数据集 12 是为了评估机器人视觉方法在不同视角、光照和背景设置下的物体识别/检测性能而收集的,这些能力对于在无约束的现实世界环境中进行机器人配套任务至关重要。该数据集包含的是手动工具,而非一般的家用物品,以代表工厂中的车间环境。它由 3 个不同类别(锤子、钳子和螺丝刀)的 9 种不同手动工具组成。图像是使用图 2.17 所示的双臂固定机器人平台收集的。数据集包含 3 种不同的光照条件、2 种不同的相机(一台 Kinect 相机和一台网络摄像头),具有不同的视角,以及 2 种不同的背景设置,总共对应 12(3×3×2)个域。对于每种手动工具,在每个相机和域设置下大约收集 40 张不同姿态的图像。表 2.16 展示了来自不同域的示例图像。数据集中总共约有 4500 张 RGB 图像。


12 https://www.nada.kth.se/cas/data/handtool/



https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_78.jpg?x=323&y=259&w=1003&h=759&r=0

Figure 2.17. The 2-arm stationary robot platform.
图 2.17. 双臂固定机器人平台。


2.6.2 Problem Formulation
2.6.2 问题表述


Suppose we collected a set of images using a robotic platform with the aim of training a robot vision model with it. Since the image collection has been acquired in the real world, the resulting model will be biased towards the particular conditions (e.g. illumination, environmental) under which the images have been acquired. Because of this, if we employ such a system and the current working conditions are different from those of the training set, the performances of the model will degrade due to the presence of a substantial shift between the training and test data. In this situation, to increase the generalization capabilities of the robot we can remove the acquisition bias either by collecting more training data in a large variety of conditions, which is extremely expensive, or by developing algorithms able to bridge the gap between the training and test data, aligning the original model to the novel scenario. The latter is the goal of domain adaptation. Formally, we assume to have a source domain S={xis,yis}i=1n ,where xis is an image and yis{1,,C} the associated semantic label. Opposite to traditional domain adaptation in a batch setting,during training we have only access to the source domain S and we do not have any data or prior information about the target domain T ,apart from the set of semantic labels which is assumed to be shared. When the robot is active, the current working conditions will compose the target domain and we will have access to the automatically acquired sequence of images T={x1t,,xTt} . In this scenario,in order to adapt the network parameters θ to this novel domain,we must exploit the incoming test images collected by the robot on the fly.
假设我们使用一个机器人平台收集了一组图像,目的是用这些图像训练一个机器人视觉模型。由于图像是在现实世界中采集的,因此得到的模型会偏向于采集图像时的特定条件(例如光照、环境条件)。正因如此,如果我们使用这样的系统,且当前的工作条件与训练集的条件不同,由于训练数据和测试数据之间存在显著差异,模型的性能将会下降。在这种情况下,为了提高机器人的泛化能力,我们可以通过以下两种方法消除采集偏差:一是在各种不同的条件下收集更多的训练数据,但这成本极高;二是开发能够缩小训练数据和测试数据之间差距的算法,使原始模型适应新的场景。后者就是领域自适应(Domain Adaptation)的目标。形式上,我们假设存在一个源领域 S={xis,yis}i=1n,其中 xis 是一幅图像,yis{1,,C} 是相关的语义标签。与批量设置下的传统领域自适应不同,在训练过程中,我们只能访问源领域 S,除了假设共享的语义标签集之外,我们没有关于目标领域 T 的任何数据或先验信息。当机器人处于工作状态时,当前的工作条件将构成目标领域,我们将能够获取机器人自动采集的图像序列 T={x1t,,xTt}。在这种情况下,为了使网络参数 θ 适应这个新领域,我们必须利用机器人即时采集的测试图像。

Table 2.16. Example Images from KTH Handtool Dataset
表2.16. KTH手动工具数据集(KTH Handtool Dataset)的示例图像

Camera TypeIllumination
ArtificialCloudyDirected
Kinect
Webcam
相机类型照明条件
人工照明阴天光照定向照明
Kinect(体感设备)
网络摄像头


2.6.3 ONDA: ONline Domain Adaptation with Batch-Normalization
2.6.3 ONDA:基于批量归一化的在线领域自适应


Starting from the idea of Domain Alignment Layers (Section 2.3), we can follow the same principle of obtaining a target-specific model but considering an online setting. In particular, instead of having a fixed target set available during training, we propose to exploit the stream of data acquired while the robot is acting in the environment and continuously update the BN statistics. In this way, we can gradually adapt the deep network to a novel scenario.
从领域对齐层(第2.3节)的思想出发,我们可以遵循获取特定目标模型的相同原则,但考虑在线设置。具体而言,我们不使用训练期间可用的固定目标集,而是提议利用机器人在环境中行动时获取的数据流,并持续更新批量归一化(Batch-Normalization,BN)统计信息。通过这种方式,我们可以使深度网络逐渐适应新场景。

Specifically,we start by training the network on the source domain S ,initializing the BN statistics at time t=0 as {μ0,σ02}={μS,σS2} . Assuming that the set of network parameters θ are shared between the source and target domain except for the BN statistics,we can adapt the network classifier fθ by updating the BN statistics with the estimates computed from the sequence T . Defining as nt the number of target images to use for updating online the BN statistics, we can compute a partial estimate {μ^t,σ^t2} of the BN statistics as:
具体来说,我们首先在源领域 S 上训练网络,将时间 t=0 的BN统计信息初始化为 {μ0,σ02}={μS,σS2} 。假设除了BN统计信息之外,网络参数集 θ 在源领域和目标领域之间是共享的,我们可以通过使用从序列 T 计算出的估计值更新BN统计信息来调整网络分类器 fθ 。将用于在线更新BN统计信息的目标图像数量定义为 nt ,我们可以将BN统计信息的部分估计值 {μ^t,σ^t2} 计算为:

μt^=1nti=1ntxitσt^2=1nti=1nt(xitμ^t)2

where xit is the distribution of activations for a given feature channel and domain t , following the notation in Section 2.3. The global statistics at time t can be updated as follows:
其中 xit 是给定特征通道和领域 t 的激活分布,遵循第2.3节中的符号表示。时间 t 的全局统计信息可以按如下方式更新:

σt2=(1α)σt12+αntnt1σ^t2

μt=(1α)μt1+αμ^t

where α is the hyperparameter regulating the decay of the moving average.
其中 α 是调节移动平均衰减的超参数。

The above formulation achieves a similar adaptation effect of DAL layers [142, 29, 28] but with three main advantages. First, no samples of the target domain, neither labeled nor unlabeled, are used during training. Thus, no further data acquisition and annotation efforts are required. Second, since we do not exploit target data for training, contrary to standard DA algorithms, we have no bias towards a particular target domain. Third, since the adaptation process is online, the model can adapt itself to multiple sequential changes of the working conditions, being able to tackle unexpected environmental variations (e.g. sudden illumination changes).
上述公式实现了与领域对齐层(Domain Alignment Layers,DAL)[142, 29, 28] 类似的自适应效果,但有三个主要优点。首先,训练期间不使用目标领域的样本,无论是有标签的还是无标签的。因此,不需要额外的数据采集和标注工作。其次,与标准的领域自适应(Domain Adaptation,DA)算法不同,由于我们不利用目标数据进行训练,所以不会对特定目标领域产生偏差。第三,由于自适应过程是在线的,模型可以自行适应工作条件的多次连续变化,能够应对意外的环境变化(例如突然的光照变化)。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_80.jpg?x=268&y=259&w=1110&h=478&r=0

Figure 2.18. The statistics of the BN layers are initialized offline, by training the network on the images of the source domain. At deployment time, the input frames are processed using the global estimate of the statistics (red lines). However the robot collects each nt input frames to compute partial BN statistics, using these estimated values to gradually update the BN statistics for the current scenario.
图2.18。通过在源领域的图像上训练网络,离线初始化BN层的统计信息。在部署时,使用统计信息的全局估计值(红线)处理输入帧。然而,机器人每收集 nt 个输入帧来计算部分BN统计信息,并使用这些估计值逐步更新当前场景的BN统计信息。


The reader might wonder if other possible choices may be considered for initializing {μ0,σ02} ,such as exploiting a first calibration phase where the robot collects images of the target domain in order to produce a first estimate of the BN statistics. Here we choose to use the statistics estimated on the source domain because 1) we want a model ready to be employed, without requiring any additional preparation at test time; 2) the robot may occur in multiple domains during employment and if a shift occurs (e.g. illumination condition changes) our method will automatically adapt the visual model to the novel domain starting from the current estimated statistics: initializing {μ0,σ02}={μS,σS2} allows to check the performance of the algorithm even for multiple sequential shifts and long-term applications. Obviously our method can benefit from a calibration phase of the statistics closer to the target working conditions: we plan to analyze these aspects in the future.
读者可能想知道是否可以考虑其他可能的方法来初始化 {μ0,σ02} ,例如利用一个初始校准阶段,在该阶段机器人收集目标领域的图像以生成BN统计信息的初始估计值。在这里,我们选择使用在源领域上估计的统计信息,因为:1)我们希望有一个随时可以使用的模型,在测试时不需要任何额外的准备;2)机器人在使用过程中可能会出现在多个领域,如果发生领域偏移(例如光照条件变化),我们的方法将从当前估计的统计信息开始,自动使视觉模型适应新领域:初始化 {μ0,σ02}={μS,σS2} 可以检查算法在多次连续偏移和长期应用中的性能。显然,我们的方法可以从更接近目标工作条件的统计信息校准阶段中受益:我们计划在未来分析这些方面。

2.6.4 Experimental results
2.6.4 实验结果


Networks and training protocols. We perform our experiments with the AlexNet [124] architecture pre-trained on ImageNet [52]. We train 3 additional models: a variant of AlexNet with BN, the DA architecture DIAL from [29] and our ONline DA model (ONDA). Following [29], we add BN layers or its variants after each fully-connected layer. Both the standard AlexNet, AlexNet with BN and DIAL are trained with a batch size of 128 . We implemented [29] by splitting the batch size between images of source and target domain proportionally to the number of images for each set, as in [29], without employing the entropy-loss for target images [29, 28]. We highlight that DIAL is our upper-bound in this case, since it shares the same philosophy of ONDA but with the assumption that images of the target domain are available at training time.
网络与训练协议。我们使用在ImageNet [52]上预训练的AlexNet [124]架构进行实验。我们额外训练了3个模型:带有批量归一化(Batch Normalization,BN)的AlexNet变体、文献[29]中的域自适应(Domain Adaptation,DA)架构DIAL以及我们的在线域自适应模型(ONline DA model,ONDA)。遵循文献[29]的方法,我们在每个全连接层之后添加BN层或其变体。标准AlexNet、带有BN的AlexNet和DIAL均以128的批量大小进行训练。我们按照文献[29]的方式实现,将批量大小按每个集合的图像数量比例分配给源域和目标域的图像,且不采用针对目标图像的熵损失[29, 28]。我们强调,在这种情况下DIAL是我们的上限,因为它与ONDA具有相同的理念,但假设在训练时可以获取目标域的图像。
As preprocessing, we rescale all the images in order to ensure a shortest side of 256 pixels, preserving the aspect ratio and subtracting the mean value per channel computed over the ImageNet database. As input to the network we use a random crop of 227×227 at training time,employing a central crop with the same dimensions during test. No additional data augmentation is performed. For all the variants of the architecture, we fine-tune the last layers for 30 epochs with an initial learning rate of 0.001 for fc6,fc7 and of 0.01 for the classifier,with a weight decay of 0.0005 and momentum 0.9 . We scale the initial learning rates by a factor of 0.1 after 25 epochs.
作为预处理,我们对所有图像进行缩放,以确保最短边为256像素,同时保持宽高比,并减去在ImageNet数据库上计算得到的每个通道的均值。在训练时,我们将随机裁剪尺寸为227×227的图像作为网络输入,在测试时采用相同尺寸的中心裁剪。不进行额外的数据增强操作。对于架构的所有变体,我们对最后几层进行30个轮次(epoch)的微调,fc6,fc7的初始学习率为0.001,分类器的初始学习率为0.01,权重衰减系数为0.0005,动量为0.9。在25个轮次后,我们将初始学习率缩小为原来的0.1倍。

In order to apply ONDA, we start from the weights of AlexNet with BN, training on the given source domain. Then, we perform one iteration over the target domain, without updating any parameter other than the BN statistics. As a trade-off between stability of the statistics and fastness of adaptation we set nt=10 and α=0.1 . We will detail the impact of these choices in the following subsections.
为了应用ONDA,我们从在给定源域上训练的带有BN的AlexNet的权重开始。然后,我们对目标域进行一次迭代,除了BN统计量之外,不更新任何其他参数。为了在统计量的稳定性和适应速度之间进行权衡,我们设置了nt=10α=0.1。我们将在以下小节中详细介绍这些选择的影响。

In all the experiments, we consider the task of object recognition in the fine-grained setting, with all the 9 classes considered as classification objective. We report the average accuracy between 5 runs, shuffling the order of the input images in each run of our model.
在所有实验中,我们考虑细粒度设置下的目标识别任务,将所有9个类别作为分类目标。我们报告模型5次运行的平均准确率,每次运行时对输入图像的顺序进行打乱。

Domain Adaptation results
域自适应结果


In this subsection, we will present the results of our algorithm. In order to analyze the particular effect that each possible change may have to the adaptation capabilities of our model, we isolate the sources of shift. To this extent, we consider two sample starting source domains: in the first case (Figure 2.19a), the acquisition conditions are artificial light, Kinect camera and white background; in the second case we consider cloudy illumination, webcam and brown background (Figure 2.19b). From these source domains we start by changing only one of the acquisition conditions (left part of the figures) and gradually increasing the number of changes to 2 and 3 conditions (middle and right parts respectively). We report the results for our model after 25%,50% and 90% of the target data processed.
在本小节中,我们将展示我们算法的结果。为了分析每个可能的变化对我们模型的自适应能力的具体影响,我们分离出偏移的来源。为此,我们考虑两个样本起始源域:在第一种情况(图2.19a)中,采集条件为人工照明、Kinect相机和白色背景;在第二种情况中,我们考虑阴天照明、网络摄像头和棕色背景(图2.19b)。从这些源域开始,我们首先仅改变一个采集条件(图的左半部分),然后逐渐将改变的条件数量增加到2个和3个(分别为图的中间和右半部分)。我们报告在处理了25%,50%90%的目标数据后我们模型的结果。

As the figures show, our model is able to fill the gap between the BN baseline (red bars) and the DA upper bound DIAL (green bars) in almost all settings. Only in few cases, where the shift between the performances of BN and DIAL is lower, this does not happen (i.e. Figure 2.19a, target artificial-Kinect-brown and directed-Kinect-White). In all the other settings the gains are remarkable: considering both figures, the average difference between the performance of BN and ONDA-90 are of 15%,18% and 20% for the single,double and triple shift cases respectively. We stress that the gain increases with the amount of shift between the source and target domains, underlying the importance of applying DA adaptation methods in changing environments. As expected, the statistics computed in the first stages (i.e. ONDA-25) are not always sufficiently representative of the true estimate since they may be still biased by the statistics computed over the source domain. However the estimate becomes more precise as more images of the target domain are processed (i.e. ONDA-50 and ONDA-90), gradually covering the gap with the estimate computed by DIAL. The fastness of adaptation and the quality of the estimates depend on the two hyperparameters α and nt . In the next subsection we will analyze their impact to the final performances of the algorithm.
如图所示,在几乎所有设置下,我们的模型都能够缩小BN基线(红色条形图)和DA上限DIAL(绿色条形图)之间的差距。仅在少数情况下,当BN和DIAL的性能差距较小时,这种情况不会发生(即图2.19a中的目标人工-Kinect-棕色和定向-Kinect-白色)。在所有其他设置中,提升效果显著:综合考虑两个图,在单偏移、双偏移和三偏移情况下,BN和ONDA - 90的性能平均差异分别为15%,18%20%。我们强调,随着源域和目标域之间的偏移量增加,提升效果也会增加,这凸显了在变化的环境中应用DA自适应方法的重要性。正如预期的那样,在初始阶段计算的统计量(即ONDA - 25)并不总是能充分代表真实估计,因为它们可能仍然受到在源域上计算的统计量的影响。然而,随着处理的目标域图像数量增加,估计变得更加精确(即ONDA - 50和ONDA - 90),逐渐缩小与DIAL计算的估计之间的差距。适应速度和估计质量取决于两个超参数αnt。在下一小节中,我们将分析它们对算法最终性能的影响。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_82.jpg?x=271&y=264&w=1109&h=620&r=0

Figure 2.19. Experiments on isolated shifts. The labels of the x-axes denote the conditions of target domain, with the first line indicating the light condition, the second the camera and the third the background. We underlined the changes between the source and target domains.
图2.19. 孤立偏移实验。x轴标签表示目标域的条件,第一行表示光照条件,第二行表示相机,第三行表示背景。我们对源域和目标域之间的变化进行了下划线标注。


Ablation study. In this subsection we analyze the impact of the two hyperparam-eters,the update frequency nt and the decay α ,on the number of images needed by ONDA to estimate the statistics for the target domain. We use a sample scenario of Figure 2.19b, where cloudy illumination, webcam camera and brown background are the source domain conditions and artificial light, Kinect camera and white background are the target domain ones. In the first experiment,we fix nt to 10, varying the value of α . We start by a single pre-trained model of AlexNet with BN repeating the experiments for 5 runs, shuffling the order of the input data, and reporting the average accuracy for each update step.
消融研究。在本小节中,我们分析了两个超参数,即更新频率 nt 和衰减率 α,对ONDA估计目标域统计信息所需图像数量的影响。我们使用图2.19b中的一个示例场景,其中阴天光照、网络摄像头和棕色背景是源域条件,人造光、Kinect相机和白色背景是目标域条件。在第一个实验中,我们将 nt 固定为10,改变 α 的值。我们从一个使用批量归一化(BN)预训练的AlexNet模型开始,重复实验5次,打乱输入数据的顺序,并报告每个更新步骤的平均准确率。

Results are shown in Figure 2.20a: increasing the value of α to 0.2 (green line) or 0.5 (black line) allows the model to achieve a faster adaptation to the target conditions, with the drawback of a noisier estimation of the statistics. Thus, increasing α leads to an unstable convergence of the performance. On the other hand, choosing too low values of α (e.g. 0.05 or 0.01,purple and gold lines respectively) allows a more stable convergence of the model, but with the drawback of slower adaptation to the novel conditions.
结果如图2.20a所示:将 α 的值增加到0.2(绿线)或0.5(黑线)可以使模型更快地适应目标条件,但代价是统计信息的估计噪声更大。因此,增加 α 会导致性能收敛不稳定。另一方面,选择过低的 α 值(例如0.05或0.01,分别为紫色和金色线)可以使模型收敛更稳定,但缺点是对新条件的适应速度较慢。

Regarding the hyperparameter nt ,we follow the same protocol of the first experiment,fixing α to 0.1 and varying the number of images collected before updating the statistics, nt ,reporting how the accuracy changes with respect to the number of frames processed. As Figure 2.20b shows,low values of nt (e.g. nt=2 ) allows a faster adaptation, due to the higher update frequency, but at the price of a noisier estimation of the statistics, which is harmful to the final accuracy achieved by the model. At the same time,high values of nt (e.g. 20,30) allow for a more precise estimate of the statistics, highlighted by the smoothness of the respective lines in the graph, with the drawback of a lower speed of adaptation to the novel domain, caused by the lower update frequency.
关于超参数 nt,我们遵循第一个实验的相同方案,将 α 固定为0.1,改变更新统计信息之前收集的图像数量 nt,报告准确率随处理帧数的变化情况。如图2.20b所示,较低的 nt 值(例如 nt=2)由于更新频率较高,允许更快的适应,但代价是统计信息的估计噪声更大,这对模型最终达到的准确率有害。同时,较高的 nt 值(例如20、30)允许对统计信息进行更精确的估计,图中相应线条的平滑度就体现了这一点,但缺点是由于更新频率较低,对新领域的适应速度较慢。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_83.jpg?x=276&y=290&w=1087&h=306&r=0

Figure 2.20. Accuracy vs number of updates of ONDA for different values of (a) α and (b) nt in a sample scenario. The red line denotes the BN lower bound of the starting model, while the yellow line the DIAL upper bound.
图2.20. 在一个示例场景中,不同 (a) α 和 (b) nt 值下ONDA的准确率与更新次数的关系。红线表示起始模型的批量归一化(BN)下限,而黄线表示DIAL上限。


The speed of adaptation and the final quality of the BN statistics is obviously a consequence of the values chosen for both hyperparameters. Obviously α and nt are not independent from each other: for a lower nt a lower α should be selected in order to preserve the final performance of the algorithm and conversely for a higher nt ,a higher α will allow a faster adaptation of the model. As a trade-off between fast adaptation and good results,we found experimentally that choosing nt={5,10,20} and α={0.05,0.1} worked well for both short and long term experiments.
批量归一化(BN)统计信息的适应速度和最终质量显然是两个超参数所选值的结果。显然 αnt 彼此并非独立:对于较低的 nt,应选择较低的 α 以保持算法的最终性能,反之,对于较高的 nt,较高的 α 将使模型更快地适应。作为快速适应和良好结果之间的权衡,我们通过实验发现,选择 nt={5,10,20}α={0.05,0.1} 对于短期和长期实验都很有效。

2.6.5 Conclusions
2.6.5 结论


In this section, we presented a novel dataset for addressing the kitting task in robotics. The dataset takes into account multiple variations of acquisition conditions such as camera, illumination, and background changes which may occur during the robot employment. This dataset is intended for testing the robustness of robot vision algorithms to changing environments, providing a novel benchmark for assessing the robustness of robot vision systems.
在本节中,我们提出了一个用于解决机器人装配任务的新型数据集。该数据集考虑了采集条件的多种变化,如相机、光照和背景变化,这些变化可能在机器人使用过程中发生。该数据集旨在测试机器人视觉算法对不断变化环境的鲁棒性,为评估机器人视觉系统的鲁棒性提供了一个新的基准。

Additionally, we described ONDA, an algorithm capable of performing online adaptation of deep models to any unseen visual domain. The algorithm, based on the update of the statistics of batch-normalization layers, can continuously adapt the model to the current environmental conditions of the robot, providing more robustness to unexpected working conditions. Experiments on the newly proposed dataset, confirm the ability of ONDA to fill the gap between a standard architecture, trained only on source data, and its domain adapted counterpart, without requiring any additional target data during training.
此外,我们介绍了ONDA,这是一种能够对任何未见视觉领域进行深度模型在线自适应的算法。该算法基于批量归一化层统计信息的更新,可以使模型持续适应机器人当前的环境条件,从而对意外工作条件具有更强的鲁棒性。在新提出的数据集上进行的实验证实,ONDA能够填补仅在源数据上训练的标准架构与其经过领域自适应的对应架构之间的差距,且在训练过程中无需任何额外的目标数据。

It is worth highlighting how, despite its effectiveness and the fact of requiring a single source domain (differently from the DG approaches in Section 2.5), the method has two main drawbacks. Since it adapts to the stream of the target samples, its adaptation is gradual and it cannot work under abrupt changes of the input distribution. As a consequence, it can only address one target domain shift at the time, contrary to DG approaches, which build a single model for multiple target domains. In the next section, we will show how we can merge the benefits of DG and Continuous DA, proposing the first deep model for the task of Predictive DA.
值得强调的是,尽管该方法有效且仅需一个源域(与2.5节中的领域泛化(DG)方法不同),但它有两个主要缺点。由于它是对目标样本流进行自适应,其自适应过程是渐进的,无法应对输入分布的突然变化。因此,与能为多个目标域构建单一模型的DG方法不同,它一次只能处理一个目标域偏移。在下一节中,我们将展示如何结合DG和连续域自适应(Continuous DA)的优势,提出用于预测域自适应(Predictive DA)任务的首个深度模型。
Finally, as future works, we plan to enlarge the dataset, including more sources of variations and more objects. We further plan to provide a deeper analysis of our algorithm with more architectures, as well as exploring possible extensions that could exploit knowledge coming from previously met scenarios.
最后,作为未来的工作,我们计划扩大数据集,纳入更多的变化来源和更多的对象。我们还计划对我们的算法进行更深入的分析,尝试更多的架构,并探索可能的扩展方法,以利用先前遇到的场景中的知识。

2.7 Predictive Domain Adaptation 13
2.7 预测域自适应 13


An underline common thread linking the sections of this chapter is the importance of being able to overcome the domain shift problem even under incomplete (Section 2.4) or absent (Sections 2.5 and 2.6) information about our target domain during training. In particular, although it might be reasonable for some applications to have target samples available during training, in most cases data collection and labeling might be too costly (e.g. robotics) or even unfeasible (e.g. hazardous environments). Therefore, we argued that it is important to build models able to perform domain adaptation even without target data at training time.
本章各节之间的一个共同主线是,即使在训练期间关于目标域的信息不完整(2.4节)或缺失(2.5节和2.6节)的情况下,能够克服域偏移问题的重要性。特别是,虽然对于某些应用来说,在训练期间有目标样本可用可能是合理的,但在大多数情况下,数据收集和标注可能成本过高(如机器人领域),甚至不可行(如危险环境)。因此,我们认为构建即使在训练时没有目标数据也能执行域自适应的模型是很重要的。

For this reason, in Sections 2.5 and 2.6 we focused on scenarios that do not assume the presence of target data during training, namely DG and Continuous DA. In both scenarios, different information is used to overcome the domain shift. In the first, DG, the presence of multiple labeled source domains allows us to build models disentangling domain-specific and semantic-specific information, possibly generalizing to unseen input distributions. In the second, Continuous DA, target data received at test time are used to update the model gradually. Both scenarios have some inherent drawbacks. In DG, we require the presence of multiple labeled source domain, something that might be hard to obtain. In Continuous DA, instead, the model gradually adapts to the target distribution and, consequently, (i) it cannot work under abrupt changes of domains and (ii) it can address only one target domain shift at the time.
出于这个原因,在2.5节和2.6节中,我们关注的是训练期间不假设存在目标数据的场景,即领域泛化(DG)和连续域自适应(Continuous DA)。在这两种场景中,使用不同的信息来克服域偏移。在第一种场景DG中,多个带标签的源域的存在使我们能够构建解耦特定领域信息和特定语义信息的模型,有可能推广到未见的输入分布。在第二种场景连续域自适应中,测试时接收到的目标数据用于逐步更新模型。这两种场景都有一些固有的缺点。在DG中,我们需要多个带标签的源域,而这可能很难获得。而在连续域自适应中,模型会逐渐适应目标分布,因此,(i)它无法应对域的突然变化,(ii)它一次只能处理一个目标域偏移。

In this section, we want to take a step forward by (i) dropping the assumption of having multiple labeled source domains (opposite to DG) and (ii) adding the possibility to rapidly adapt the model to multiple target domains (opposite to Continuous DA). Following this idea, previous studies proposed the Predictive Domain Adaptation (PDA) scenario [293], where neither the data nor the labels from the target are available during training. Only annotated source samples are available, together with additional information from a set of auxiliary domains, in form of unlabeled samples and associated metadata (e.g. corresponding to the image timestamp or camera pose, etc).
在本节中,我们想进一步推进,(i)摒弃存在多个带标签源域的假设(与DG相反),(ii)增加模型快速适应多个目标域的可能性(与连续域自适应相反)。基于这一想法,先前的研究提出了预测域自适应(PDA)场景 [293],在训练期间,目标域的数据和标签都不可用。只有带注释的源样本可用,以及来自一组辅助域的额外信息,以无标签样本和相关元数据(如对应图像时间戳或相机姿态等)的形式存在。

In this section we describe AdaGraph [165], a deep architecture for PDA. As for the works presented in previous sections, we learn a set of domain-specific models by considering a common backbone network with domain-specific alignment layers embedded into it [28,29,142] . However,differently from the previous works, we propose to exploit metadata and auxiliary samples by building a graph which explicitly describes the dependencies among domains. Within the graph, nodes represent domains, while edges encode relations between domains, imposed by their metadata. Thanks to this construction, when metadata for the target domain are available at test time, the domain-specific model can be recovered. We further exploit target data directly at test time by devising an approach for continuously updating the deep network parameters once target samples are made available (Figure 2.21). We demonstrate the effectiveness of our method with experiments on three datasets: the Comprehensive Cars (CompCars) [292], the Century of Portraits [82] and the CarEvolution datasets [218], showing that our method outperforms state-of-the-art PDA approaches. Finally, we show that the proposed approach for continuous updating of the network parameters can be used for continuous domain adaptation, producing more accurate predictions than previous methods [103, 139].
在本节中,我们描述了AdaGraph [165],一种用于PDA的深度架构。与前面几节介绍的工作一样,我们通过考虑一个嵌入了特定领域对齐层的通用骨干网络来学习一组特定领域的模型 [28,29,142]。然而,与之前的工作不同,我们建议通过构建一个明确描述各领域之间依赖关系的图来利用元数据和辅助样本。在图中,节点代表领域,而边编码由元数据施加的领域之间的关系。由于这种构建方式,当测试时目标域的元数据可用时,就可以恢复特定领域的模型。我们还通过设计一种在目标样本可用时持续更新深度网络参数的方法,在测试时直接利用目标数据(图2.21)。我们通过在三个数据集上的实验证明了我们方法的有效性:综合汽车数据集(CompCars) [292]、百年肖像数据集 [82] 和汽车进化数据集 [218],表明我们的方法优于现有最先进的PDA方法。最后,我们表明,所提出的持续更新网络参数的方法可用于连续域自适应,比以前的方法 [103, 139] 产生更准确的预测。


13 M. Mancini,S. Rota Bulò,B. Caputo,E. Ricci. AdaGraph: Unifying Predictive and Continuous Domain Adaptation through Graphs. IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR) 2019.
13 M. Mancini, S. Rota Bulò, B. Caputo, E. Ricci. AdaGraph: 通过图统一预测和连续域自适应。IEEE/CVF国际计算机视觉与模式识别会议(CVPR)2019。




https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_86.jpg?x=268&y=266&w=1116&h=602&r=0

Figure 2.21. Predictive Domain Adaptation. During training we have access to a labeled source domain (yellow block) and a set of unlabeled auxiliary domains (blue blocks), all with associated metadata. At test time, given the metadata corresponding to the unknown target domain, we predict the parameters associated to the target model. This predicted model is further refined during test, while continuously receiving data of the target domain.
图2.21. 预测性领域自适应。在训练期间,我们可以访问一个有标签的源领域(黄色块)和一组无标签的辅助领域(蓝色块),所有这些领域都有相关的元数据。在测试时,给定与未知目标领域相对应的元数据,我们预测与目标模型相关的参数。这个预测模型在测试期间会进一步优化,同时持续接收目标领域的数据。


To summarize, the contributions presented in this section are: (i) the first deep architecture for addressing the problem of PDA; (ii) a strategy for injecting metadata information within a deep network architecture by encoding the relation between different domains through a graph; (iii) a simple strategy for refining the predicted target model which exploits the incoming stream of target data directly at test time.
综上所述,本节提出的贡献如下:(i)第一个用于解决预测性领域自适应(PDA)问题的深度架构;(ii)一种通过图对不同领域之间的关系进行编码,从而将元数据信息注入深度网络架构的策略;(iii)一种在测试时直接利用传入的目标数据流来优化预测目标模型的简单策略。

2.7.1 Problem Formulation
2.7.1 问题表述


Our goal is to produce a model that is able to accomplish a task in a target domain T for which no data is available during training,neither labeled nor unlabeled. The only information we can exploit is a characterization of the content of the target domain in the form of metadata mt plus a set of known domains K ,each of them having associated metadata. All the domains in K carry information about the task we want to accomplish in the target domain. In particular, since in this work we focus on classification tasks,we assume that images from the domains in K and T can be classified with semantic labels from a same set Y . As opposed to standard DA scenarios,the target domain T does not necessarily belong to the set of known domains K . Also,we assume that K can be partitioned into a labeled source domain S and N unlabeled auxiliary domains A={A1,,AN} .
我们的目标是生成一个能够在目标领域 T 中完成任务的模型,在训练期间该领域没有可用的数据,无论是有标签的还是无标签的。我们唯一可以利用的信息是以元数据 mt 形式呈现的目标领域内容的特征描述,以及一组已知领域 K ,每个领域都有相关的元数据。K 中的所有领域都包含了我们想要在目标领域中完成任务的相关信息。特别地,由于在这项工作中我们专注于分类任务,我们假设 KT 领域中的图像可以用同一集合 Y 中的语义标签进行分类。与标准的领域自适应(DA)场景不同,目标领域 T 不一定属于已知领域集合 K 。此外,我们假设 K 可以划分为一个有标签的源领域 SN 个无标签的辅助领域 A={A1,,AN}
In the specific, in this section we focus on the predictive DA (PDA) problem, aimed at regressing the target model parameters using data from the domains in K . We achieve this objective by (i) interconnecting each domain in K using the given domain metadata; (ii) building domain-specific models from the data available in each domain in K ; (iii) exploiting the connection between the target domain and the domains in K ,inferred from the respective metadata,to regress the model for T .
具体而言,在本节中,我们专注于预测性领域自适应(PDA)问题,旨在使用 K 领域的数据回归目标模型参数。我们通过以下方式实现这一目标:(i)使用给定的领域元数据将 K 中的每个领域相互连接;(ii)根据 K 中每个领域的可用数据构建特定领域的模型;(iii)利用从各自元数据推断出的目标领域与 K 中各领域之间的联系,为 T 回归模型。

A schematic representation of the method is shown in Figure 2. We propose to use a graph because of its seamless ability to encode relationships within a set of elements (domains in our case). Moreover, it can be easily manipulated to include novel elements (such as the target domain T ).
该方法的示意图如图2所示。我们提议使用图,因为它能够无缝地对一组元素(在我们的案例中是领域)之间的关系进行编码。此外,它可以很容易地进行操作以包含新的元素(例如目标领域 T )。

2.7.2 AdaGraph: Graph-based Predictive DA
2.7.2 AdaGraph:基于图的预测性领域自适应


We model the dependencies between the various domains by instantiating a graph composed of nodes and edges. Each node represents a different domain and each edge measures the relatedness of two domains. Each edge of the graph is weighted, and the strength of the connection is computed as a function of the domain-specific metadata. At the same time, in order to extract one model for each available domain, we employ recent advances in domain adaptation involving the use of domain-specific batch-normalization layers [141,29] . With the domain-specific models and the graph we are able to predict the parameters for a novel domain that lacks data by simply (i) instantiating a new node in the graph and (ii) propagating the parameters from nearby nodes, exploiting the graph connections.
我们通过实例化一个由节点和边组成的图来对各个领域之间的依赖关系进行建模。每个节点代表一个不同的领域,每条边衡量两个领域之间的相关性。图的每条边都有权重,连接的强度是根据特定领域的元数据计算得出的。同时,为了为每个可用领域提取一个模型,我们采用了领域自适应方面的最新进展,涉及使用特定领域的批量归一化层 [141,29] 。有了特定领域的模型和图,我们能够通过以下简单步骤为缺乏数据的新领域预测参数:(i)在图中实例化一个新节点;(ii)利用图的连接从相邻节点传播参数。

Connecting domains through a graph. Let us denote the space of domains as D and the space of metadata as M . As stated in Section 2.7.1,in the PDA scenario, we have a set of known domains K={k1,,kn}D and a bijective mapping ϕ:DM relating domains and metadata. For simplicity,we regard as unknown a metadata m that is not associated to domains in K ,i.e. such that ϕ1(m)K .
通过图连接领域。我们将领域空间表示为D,将元数据空间表示为M。如2.7.1节所述,在PDA(概率数据关联,Probabilistic Data Association)场景中,我们有一组已知领域K={k1,,kn}D和一个将领域与元数据相关联的双射映射ϕ:DM。为简单起见,我们将与K中的领域不相关联的元数据m视为未知,即满足ϕ1(m)K

Here we structure the domains as a graph G=(V,E) ,where VD represents the set of vertices corresponding to domains and EV×V the set of edges,i.e. relations between domains. Initially the graph contains only the known domains so V=K . In addition,we define an edge weight ω:ER that measures the relation strength between two domains (v1,v2)E by computing a distance between the respective metadata, i.e.
在这里,我们将领域构建为一个图G=(V,E),其中VD表示与领域对应的顶点集,EV×V表示边集,即领域之间的关系。最初,图仅包含已知领域,因此V=K。此外,我们定义了一个边权重ω:ER,它通过计算相应元数据之间的距离来衡量两个领域(v1,v2)E之间的关系强度,即

(2.15)ω(v1,v2)=ed(ϕ(v1),ϕ(v2)),

where d:M2R is a distance function on M .
其中d:M2RM上的一个距离函数。

Let Θ be the space of possible model parameters and assume we have properly exploited the domain data from each domain in kK to learn a set of domain-specific models (we will detail this procedure in the next subsection). We can then define a mapping ψ:KΘ ,relating each domain to its set of domain-specific parameters. Given some metadata mM we can recover an associated set of parameters via the mapping ψϕ1(m) provided that ϕ1(m)K . In order to deal with metadata that is unknown, we introduce the concept of virtual node. Basically, a virtual node v~V is a domain for which no data is available but we have metadata m~ associated to it,namely m~=ϕ(v~) . For simplicity,let us directly consider the target domain T . We have TD and we know ϕ(T)=mt . Since no data of T is available,we have no parameters that can be directly assigned to the domain. However, we can estimate parameters for T by using the domain graph G . Indeed,we can relate T to other domains vV using ω(T,v) defined in (2.15) by opportunely extending E with new edges(T,v)for all or some vV (e.g. we could connect all v that satisfy ω(T,v)>τ for some τ ). The extended graph G=(V{T},E) with the additional node T and the new edge set E can then be exploited to estimate parameters for T by propagating the model parameters from nearby domains. Formally we regress the parameters θ^T through the formula
Θ为可能的模型参数空间,并假设我们已恰当地利用了kK中每个领域的领域数据来学习一组特定领域的模型(我们将在下一小节详细介绍此过程)。然后,我们可以定义一个映射ψ:KΘ,将每个领域与其特定领域的参数集相关联。给定一些元数据mM,只要ϕ1(m)K成立,我们就可以通过映射ψϕ1(m)恢复一组相关的参数。为了处理未知的元数据,我们引入了虚拟节点的概念。基本上,虚拟节点v~V是一个没有可用数据但有与之关联的元数据m~的领域,即m~=ϕ(v~)。为简单起见,让我们直接考虑目标领域T。我们有TD,并且知道ϕ(T)=mt。由于没有T的可用数据,我们没有可以直接分配给该领域的参数。然而,我们可以使用领域图G来估计T的参数。实际上,我们可以使用(2.15)中定义的ω(T,v),通过适当地用新边(T, v)扩展E,将T与其他领域vV相关联(例如,我们可以连接所有满足对于某个τω(T,v)>τv)。然后,可以利用带有额外节点T和新边集E的扩展图G=(V{T},E),通过传播附近领域的模型参数来估计T的参数。形式上,我们通过以下公式回归参数θ^T
(2.16)θ^T=ψ(T)=(T,v)Eω(T,v)ψ(v)(T,v)Eω(T,v),

where we normalize the contribution of each edge by the sum of the weights of the edges connecting node T . With this formula we are able to provide model parameters for the target domain T and,in general,for any unknown domain by just exploiting the corresponding metadata.
其中,我们通过连接节点T的边的权重之和对每条边的贡献进行归一化。有了这个公式,我们能够仅通过利用相应的元数据为目标领域T,并且一般来说为任何未知领域提供模型参数。

We want to highlight that this strategy only depends extending the graph with a virtual node v~ and computing the relative edges. While the relations of v~ with other domains can be inferred from given metadata, as in (2.15), there could be cases in which no metadata is available for the target domain. In such situations, we can still exploit the incoming target image x to build a probability distribution over nodes in V ,in order to assign the new data point to a mixture of known domains. To this end,let use define p(vx) the conditional probability of an image xX ,where X is the image space,to be associated with a domain vV . From this probability distribution,we can infer the parameters of a classification model for x through:
我们想强调的是,这种策略仅依赖于用虚拟节点v~扩展图并计算相对边。虽然v~与其他领域的关系可以从给定的元数据中推断出来,如(2.15)所示,但可能存在目标领域没有可用元数据的情况。在这种情况下,我们仍然可以利用传入的目标图像xV中的节点上构建一个概率分布,以便将新的数据点分配给已知领域的混合。为此,让我们定义p(vx)为图像xX(其中X是图像空间)与领域vV相关联的条件概率。从这个概率分布中,我们可以通过以下方式推断x的分类模型的参数:

(2.17)θ^x=vVp(vx)ψ(v)

where ψ(v) is well defined for each node linked to a known domain,while it must be estimated with (2.16) for each virtual domain v~V for which p(v~x)>0 .
其中,对于每个与已知域相连的节点,ψ(v) 定义明确,而对于每个虚拟域 v~V(其中 p(v~x)>0),则必须使用 (2.16) 进行估计。

In practice,the probability p(vx) is constructed from a metadata classifier μ , trained on the available data,that provides a probability distribution over M given x ,which can be turned into a probability over D through the inverse mapping ϕ1 .
实际上,概率 p(vx) 是由一个元数据分类器 μ 构建的,该分类器在可用数据上进行训练,它在给定 x 的情况下提供了关于 M 的概率分布,通过逆映射 ϕ1 可以将其转换为关于 D 的概率。

Extracting node specific models. We have described how to regress model parameters for an unknown domain by exploiting the domain graph. Now, we focus on the actual problem of training domain-specific models using data available from the known domains K . Since K entails a labeled source domain S and a set of auxiliary domains A ,we cannot simply train independent models with data from each available domain due to the lack of supervision on domains in A for the target classification task. For this reason, we need to estimate the model parameters for the unlabeled domains A by exploiting DA techniques.
提取特定节点的模型。我们已经描述了如何通过利用域图来回归未知域的模型参数。现在,我们专注于使用来自已知域 K 的可用数据训练特定域模型的实际问题。由于 K 包含一个有标签的源域 S 和一组辅助域 A,由于目标分类任务在 A 中的域上缺乏监督,我们不能简单地使用每个可用域的数据来训练独立的模型。因此,我们需要通过利用域适应(DA)技术来估计无标签域 A 的模型参数。

To achieve this, we start from the domain alignment layers presented in [141, 28, 29] and described in Section 2.3. In this scenario, the set of parameters for a domain k,ψ(k)=θk is composed of different parts. Formally for each domain we have ψ(k)={θa,θks} ,where θa holds the domain-agnostic components and θks the domain-specific ones. In our case θa comprises parameters from standard layers (i.e. the convolutional and fully connected layers of the architecture),while θks comprises parameters and statistics of the domain-specific BN layers.
为了实现这一点,我们从文献 [141, 28, 29] 中提出并在第 2.3 节中描述的域对齐层开始。在这种情况下,一个域 k,ψ(k)=θk 的参数集由不同部分组成。形式上,对于每个域,我们有 ψ(k)={θa,θks},其中 θa 包含与域无关的组件,而 θks 包含特定于域的组件。在我们的例子中,θa 包括标准层(即架构的卷积层和全连接层)的参数,而 θks 包括特定于域的批量归一化(BN)层的参数和统计信息。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_89.jpg?x=268&y=265&w=1109&h=506&r=0

Figure 2.22. AdaGraph framework (Best viewed in color). Each BN layer is replaced by its GBN counterpart. The parameters used in a GBN layer are computed from a given metadata and the graph. Each domain in the graph (circles) contains its specific parameters (rectangular blocks). During the training phase (blue part), a metadata (i.e. mz ,blue block) is mapped to its domain (z). While the statistics of GBN are determined only by the one of z(θz) ,scale and bias are computed considering also the graph edges. During test,we receive the metadata for the target domain (mv~ ,red block) to which no node is linked. Thus we initialize v~ and we compute its parameters and statistics exploiting the connection with the other nodes in the graph (θv~) .
图 2.22. AdaGraph 框架(彩色视图效果最佳)。每个 BN 层都被其对应的图批量归一化(GBN)层所取代。GBN 层中使用的参数是根据给定的元数据和图计算得出的。图中的每个域(圆圈)都包含其特定的参数(矩形块)。在训练阶段(蓝色部分),一个元数据(即 mz,蓝色块)被映射到其所属的域 (z)。虽然 GBN 的统计信息仅由 z(θz) 的统计信息确定,但在计算缩放和偏置时还会考虑图的边。在测试时,我们接收到目标域 (mv~(红色块)的元数据,该域没有与之相连的节点。因此,我们初始化 v~,并利用图中与其他节点的连接来计算其参数和统计信息 (θv~)


We start by using the labeled source domain S to estimate θa and initialize θSs . In particular,we obtain θS by minimizing the standard cross-entropy loss:
我们首先使用有标签的源域 S 来估计 θa 并初始化 θSs。具体来说,我们通过最小化标准交叉熵损失来获得 θS

(2.18)L(θS)=1|S|(x,y)Slog(fθS(y;x)),

where fθS is the classification model for the source domain,with parameters θS .
其中 fθS 是源域的分类模型,其参数为 θS

To extract the domain-specific parameters θks for each kK ,we employ 2 steps: the first is a selective forward pass for estimating the domain-specific statistics while the second is the application of a loss to further refine the scale and bias parameters. Formally, we replace each BN layer in the network with a GraphBN counterpart (GBN), where the forward pass is defined as follows:
为了提取每个 kK 的特定于域的参数 θks,我们采用两个步骤:第一步是进行选择性前向传播以估计特定于域的统计信息,第二步是应用一个损失函数来进一步细化缩放和偏置参数。形式上,我们将网络中的每个 BN 层替换为对应的图批量归一化(GBN)层,其中前向传播定义如下:

(2.19)GBN(x,v)=γvxμvσv2+ϵ+βv.

where γv and βv are the node specific scale and bias parameters of the BN layers. Basically in a GBN layer, the set of BN parameters and statistics to apply is conditioned on the node/domain to which x belongs. While this equation is similar to Eq. (2.1), we highlight that, differently from it and [29, 28], here we use domain-specific scale and bias parameters, not only statistics. During training, as for standard BN, we update the statistics by leveraging their estimate obtained from the current batch B :
其中 γvβv 是批量归一化(Batch Normalization,BN)层中特定节点的缩放和偏移参数。基本上,在图批量归一化(Graph Batch Normalization,GBN)层中,要应用的一组 BN 参数和统计量取决于 x 所属的节点/域。虽然这个方程与式 (2.1) 相似,但我们要强调的是,与式 (2.1) 以及文献 [29, 28] 不同,这里我们使用特定域的缩放和偏移参数,而不仅仅是统计量。在训练过程中,与标准 BN 一样,我们通过利用从当前批次 B 获得的估计值来更新统计量:
(2.20)μv^=1|Bv|xBvx and σv2^=1|Bv|xBv(xμv)2,

where Bv is the set of elements in the batch belonging to domain v . As for the scale and bias parameters, we optimize them by means of a loss on the model output. For the auxiliary domains, since the data are unlabeled, we use an entropy loss, while a cross-entropy loss is used for the source domain:
其中 Bv 是批次中属于域 v 的元素集合。对于缩放和偏移参数,我们通过对模型输出施加损失函数来对其进行优化。对于辅助域,由于数据没有标签,我们使用熵损失函数;而对于源域,则使用交叉熵损失函数:

(2.21)L(Θs)=1|S|(x,y)Slog(fθS(y;x))

(2.22)λAiA1|Ai|xAiyYfθAi(y;x)log(fθAi(y;x))

where Θs={θkskK} represents the whole set of domain-specific parameters and λ the trade off between the cross-entropy and the entropy loss.
其中 Θs={θkskK} 表示特定域参数的整个集合,λ 表示交叉熵损失和熵损失之间的权衡。

While (2.21) allows to optimize the domain-specific scale and bias parameters, it does not take into account the presence of the relationship between the domains, as imposed by the graph. A way to include the graph within the optimization procedure is to modify (2.19) as follows:
虽然式 (2.21) 允许对特定域的缩放和偏移参数进行优化,但它没有考虑到图所施加的域之间关系的存在。将图纳入优化过程的一种方法是对式 (2.19) 进行如下修改:

(2.23)GBN(x,v,G)=γvGxμvσv2+ϵ+βvG

with:
其中:

(2.24)νvG=kKω(v,k)νkkKω(v,k),

for ν{β,γ} . Basically we use a scale and bias parameters during the forward pass which are influenced by the graph edges, as described in (2.24).
对于 ν{β,γ}。基本上,我们在正向传播过程中使用的缩放和偏移参数会受到图的边的影响,如式 (2.24) 所述。

Taking into account the presence of G during the forward pass is beneficial for mainly two reasons. First, it allows to keep a consistency between how those parameters are computed at test time and how they are used at training time. Second,it allows to regularize the optimization of γv and βv ,which may be beneficial in cases where a domain contains few data. While the same procedure may be applied also for μv,σv ,in our current design we avoid mixing them during training. This choice is linked to the fact that each image belongs to a single domain and keeping the statistics separate allows us to estimate them more precisely.
在正向传播过程中考虑 G 的存在主要有两个好处。首先,它可以保持这些参数在测试时的计算方式和在训练时的使用方式之间的一致性。其次,它可以对 γvβv 的优化进行正则化,这在某个域包含的数据较少的情况下可能会很有帮助。虽然同样的过程也可以应用于 μv,σv,但在我们当前的设计中,我们避免在训练过程中混合它们。这一选择与每个图像只属于一个域这一事实有关,并且保持统计量的分离可以让我们更精确地估计它们。

At test time,once we have initialized the domain-specific parameters of T using either (2.16) or (2.17), the forward pass of each GBN layer is computed through (2.23). In Figure 2.22, we sketch the behaviour of AdaGraph both at training and test time.
在测试时,一旦我们使用式 (2.16) 或式 (2.17) 初始化了 T 的特定域参数,每个 GBN 层的正向传播就通过式 (2.23) 进行计算。在图 2.22 中,我们描绘了 AdaGraph 在训练和测试时的行为。

2.7.3 Model Refinement through Joint Prediction and Adaptation
2.7.3 通过联合预测和自适应进行模型细化


While the approach described in the previous section allows to perform a blind adaptation of a model to a target domain given metadata, it is not completely true that we have no information about the images of the target domain. In fact, while at training time we have no access to target data, at test time target samples are gradually made available. While we could passively classify the target data stream, this would not be an effective choice, since the information coming directly from target images is valuable and can be leveraged to refine our model. This will be extremely important, e.g. in the case of an inaccurate estimate of the target model parameters or in the presence of noisy metadata. In those cases, exploiting the stream of incoming images would compensate for the initial error.
虽然上一节中描述的方法允许在给定元数据的情况下对模型进行盲自适应,使其适应目标域,但实际上我们并非完全没有关于目标域图像的信息。事实上,虽然在训练时我们无法访问目标数据,但在测试时,目标样本会逐渐可用。虽然我们可以被动地对目标数据流进行分类,但这并不是一个有效的选择,因为直接来自目标图像的信息是有价值的,可以用来细化我们的模型。这将非常重要,例如在对目标模型参数估计不准确或存在噪声元数据的情况下。在这些情况下,利用传入图像流可以弥补初始误差。
To this end, we equip our model with a simple yet effective strategy to perform continuous domain adaptation. Following recent works [141] and our ONDA framework, we start with the observation that a simple way to continuously adapt a model to the incoming stream of target data, is just by updating the BN statistics. Formally,let us suppose our target domain is composed by a set of T observations T={x1,,xT} . Since we will receive one data sample at time,we provide our model with a memory. This memory has a fixed size M (e.g. M=16 in all our experiments) and allows to store a sequence of M target samples. Once these samples have been collected, we use them to compute a local estimate of the GBN statistics for the target domain. This estimate will be added to the global estimation of the statistics used by the GBN layers of our model, in the same way BN statistics are updated during training. After the update, we free the memory and restart collecting samples of the target domain. Obviously the presence of a memory can be used not only to estimate the statistics for updating the GBN layers,but also as a starting point for more complex optimization strategies. In this work, we exploit the memory to further refine the regressed scale and bias parameters. In particular, we follow recent BN-based DA algorithms [28, 29] and employ an entropy loss on the target domain data collected in the memory. This loss is applied to the "output" normalized through the statistics computed using the samples within the memory, in order to ensure a consistency with the training phase for the update of the statistics and the parameters of each GBN layer.
为此,我们为模型配备了一种简单而有效的策略,以执行连续的领域自适应。借鉴近期的研究成果[141]和我们的ONDA框架,我们首先观察到,让模型持续适应不断流入的目标数据的一种简单方法,就是更新批量归一化(Batch Normalization,BN)统计量。形式上,假设我们的目标领域由一组T个观测值T={x1,,xT}组成。由于我们每次只会接收一个数据样本,因此我们为模型配备了一个内存。这个内存有固定的大小M(例如,在我们所有的实验中为M=16),可以存储M个目标样本的序列。一旦收集到这些样本,我们就用它们来计算目标领域的全局批量归一化(Global Batch Normalization,GBN)统计量的局部估计值。这个估计值将被添加到模型的GBN层所使用的统计量的全局估计中,就像在训练过程中更新BN统计量一样。更新完成后,我们清空内存并重新开始收集目标领域的样本。显然,内存的存在不仅可以用于估计更新GBN层所需的统计量,还可以作为更复杂优化策略的起点。在这项工作中,我们利用内存进一步细化回归的尺度和偏置参数。具体来说,我们借鉴近期基于BN的领域自适应(Domain Adaptation,DA)算法[28, 29],并对内存中收集的目标领域数据应用熵损失。这个损失应用于通过使用内存内样本计算的统计量进行归一化后的“输出”,以确保在更新每个GBN层的统计量和参数时与训练阶段保持一致。

2.7.4 Experimental results
2.7.4 实验结果


Experimental setting
实验设置


Datasets. We analyze the performance of AdaGraph on three datasets: the Comprehensive Cars (CompCars) [292], the Century of Portraits [82] and the CarEvolution [218].
数据集。我们在三个数据集上分析AdaGraph的性能:综合汽车数据集(Comprehensive Cars,CompCars)[292]、百年肖像数据集(Century of Portraits)[82]和汽车进化数据集(CarEvolution)[218]。

The Comprehensive Cars (CompCars) [292] dataset is a large-scale database composed of 136,726 images spanning a time range between 2004 and 2015. As in [293], we use a subset of 24,151 images with 4 types of cars (MPV, SUV, sedan and hatchback) produced between 2009 and 2014 and taken under 5 different view points (front, front-side, side, rear, rear-side). Considering each view point and each manufacturing year as a separate domain we have a total of 30 domains. As in [293] we use a PDA setting where 1 domain is considered as source, 1 as target and the remaining 28 as auxiliary sets, for a total of 870 experiments. In this scenario, the metadata are represented as vectors of two elements, one corresponding to the year and the other to the view point, encoding the latter as in [293].
综合汽车数据集(CompCars)[292]是一个大规模数据库,由136,726张图像组成,时间跨度为2004年至2015年。与文献[293]一样,我们使用一个包含24,151张图像的子集,这些图像涉及4种类型的汽车(多用途汽车(MPV)、运动型多用途汽车(SUV)、轿车和掀背车),生产时间为2009年至2014年,拍摄视角有5种(正面、前侧面、侧面、背面、后侧面)。将每个视角和每个生产年份视为一个独立的领域,我们总共有30个领域。与文献[293]一样,我们采用部分领域自适应(Partial Domain Adaptation,PDA)设置,其中1个领域作为源领域,1个作为目标领域,其余28个作为辅助集,总共进行870次实验。在这种情况下,元数据表示为一个二维向量,一个元素对应年份,另一个对应视角,视角的编码方式与文献[293]相同。
Century of Portraits (Portraits) [82] is a large scale collection of images taken from American high school yearbooks. The portraits are taken over 108 years (1905- 2013) across 26 states. We employ this dataset in a gender classification task, in two different settings. In the first setting we test our PDA model in a leave-one-out scenario, with a similar protocol to the tests on the CompCars dataset. In particular, to define domains we consider spatio-temporal information and we cluster images according to decades and to spatial regions (we use 6 USA regions, as defined in [82]). Filtering out the sets where there are less than 150 images, we obtain 40 domains, corresponding to 8 decades (from 1934 on) and 5 regions (New England, Mid Atlantic, Mid West, Pacific, Southern). We follow the same experimental protocol of the CompCars experiments, i.e. we use one domain as source, one as target and the remaining 38 as auxiliaries. We encode the domain metadata as a vector of 3 elements, denoting the decade, the latitude ( 0 or 1, indicating north/south) and the east-west location (from 0 to 3), respectively. Additional details can be found in the appendix. In a second scenario, we use this dataset for assessing the performance of our continuous refinement strategy. In this case we employ all the portraits before 1950 as source samples and those after 1950 as target data.
百年肖像数据集(Portraits)[82]是一个大规模的图像集,图像来自美国高中年鉴。这些肖像拍摄时间跨度为108年(1905 - 2013年),涵盖26个州。我们在性别分类任务中以两种不同的设置使用这个数据集。在第一种设置中,我们在留一法(leave-one-out)场景下测试我们的PDA模型,测试协议与综合汽车数据集的测试类似。具体来说,为了定义领域,我们考虑时空信息,并根据十年时间段和空间区域对图像进行聚类(我们使用文献[82]中定义的6个美国区域)。过滤掉图像数量少于150张的集合后,我们得到40个领域,对应8个十年时间段(从1934年开始)和5个区域(新英格兰、大西洋中部、中西部、太平洋沿岸、南部)。我们遵循与综合汽车数据集实验相同的实验协议,即使用一个领域作为源领域,一个作为目标领域,其余38个作为辅助领域。我们将领域元数据编码为一个三维向量,分别表示十年时间段、纬度(0或1,表示北部/南部)和东西位置(从0到3)。更多详细信息可在附录中找到。在第二种场景中,我们使用这个数据集评估我们的连续细化策略的性能。在这种情况下,我们将1950年之前的所有肖像作为源样本,1950年之后的作为目标数据。

CarEvolution [292] is composed of car images collected between 1972 and 2013. It contains 1008 images of cars produced by three different manufacturers with two car models each, following the evolution of the production of those models during the years. We choose this dataset in order to assess the effectiveness of our continuous domain adaptation strategy. A similar evaluation has been employed in recent works considering online DA [139]. As in [139], we consider the task of manufacturer prediction where there are three categories: Mercedes, BMW and Volkswagen. Images of cars before 1980 are considered as the source set and the remaining are used as target samples.
CarEvolution数据集[292]由1972年至2013年间收集的汽车图像组成。它包含三个不同制造商生产的汽车的1008张图像,每个制造商有两种车型,展示了这些车型多年来的生产演变。我们选择这个数据集是为了评估我们的连续域自适应策略的有效性。最近考虑在线域自适应(Online DA)的研究[139]也采用了类似的评估方法。与文献[139]一样,我们考虑制造商预测任务,该任务有三个类别:梅赛德斯(Mercedes)、宝马(BMW)和大众(Volkswagen)。1980年之前的汽车图像被视为源数据集,其余的用作目标样本。

Networks and Training Protocols. To analyze the impact on performance of our main contributions we consider the ResNet-18 architecture [98] and perform experiments on the Portraits dataset. In particular, we apply our model by replacing each BN layer with its AdaGraph counterpart. We start with the network pre-trained on ImageNet, training it for 1 epoch on the source dataset, employing Adam as optimizer with a weight decay of 106 and a batch size of 16 . We choose a learning rate of 103 for the classifier and 104 for the rest of the architecture. We train the network for 1 epoch on the union of source and auxiliary domains to extract domain-specific parameters. We keep the same optimizer and hyperparameters except for the learning rates, decayed by a factor of 10 . The batch size is kept to 16 , but each batch is composed by elements of a single pair year-region belonging to one of the available domains (either auxiliary or source). The order of the pairs is randomly sampled within the set of allowed ones.
网络与训练方案。为了分析我们的主要贡献对性能的影响,我们考虑使用ResNet - 18架构[98],并在肖像(Portraits)数据集上进行实验。具体而言,我们通过将每个批量归一化(BN)层替换为其对应的AdaGraph层来应用我们的模型。我们从在ImageNet上预训练的网络开始,在源数据集上训练1个轮次,使用Adam作为优化器,权重衰减为106,批量大小为16。我们为分类器选择的学习率为103,为架构的其余部分选择的学习率为104。我们在源域和辅助域的并集上训练网络1个轮次,以提取特定于域的参数。除了学习率衰减为原来的十分之一外,我们保持相同的优化器和超参数。批量大小保持为16,但每个批量由属于可用域(辅助域或源域)之一的单个年份 - 区域对的元素组成。这些对的顺序在允许的集合中随机采样。

In order to fairly compare with previous methods we also consider Decaf features [59]. In particular, in the experiments on the CompCars dataset, we use Decaf features extracted at the fc7 layer. Note that these features are comparable to the ones used in [293] (i.e. penultimate layer of the VGG-F model in [35]). Similarly, for the experiments on CarEvolution, we follow [139] and use Decaf features extracted at the fc6 layer. In both cases, we apply our model by adding either a BN layer or our AdaGraph approach directly to the features, followed by a ReLU activation and a linear classifier. For these experiments we train the model on the source domain for 10 epochs using Adam as optimizer with a learning rate of 103 ,a batch size of 16 and a weight decay of 106 . The learning rate is decayed by a factor of 10 after 7 epochs. For CompCars, when training with the auxiliary set, we use the same optimizer,batch size and weight decay,with a learning rate 104 for 1 epoch. Domain-specific batches are randomly sampled, as for the experiments on Portraits.
为了与以前的方法进行公平比较,我们还考虑了Decaf特征[59]。具体而言,在CompCars数据集的实验中,我们使用在fc7层提取的Decaf特征。请注意,这些特征与文献[293]中使用的特征(即文献[35]中VGG - F模型的倒数第二层)具有可比性。类似地,在CarEvolution数据集的实验中,我们遵循文献[139],使用在fc6层提取的Decaf特征。在这两种情况下,我们通过直接在特征上添加批量归一化(BN)层或我们的AdaGraph方法,然后进行ReLU激活和线性分类器来应用我们的模型。对于这些实验,我们使用Adam作为优化器,在源域上训练模型10个轮次,学习率为103,批量大小为16,权重衰减为106。学习率在7个轮次后衰减为原来的十分之一。对于CompCars数据集,在使用辅助集进行训练时,我们使用相同的优化器、批量大小和权重衰减,学习率为104,训练1个轮次。与肖像数据集的实验一样,特定于域的批量是随机采样的。
For all the experiments we use as distance measure d(x,y)=12σxy22 with σ=0.1 and set λ equal to 1.0,both in the training and in the refinement stage. At test time, we classify each input image as it arrives, performing the refinement step after the classification. The buffer size in the refinement phase is equal to 16 and we set α=0.1 ,the same used for updating the GBN components while training with the auxiliar domains.
在所有实验中,我们在训练和细化阶段都使用d(x,y)=12σxy22作为距离度量,其中σ=0.1,并将λ设置为1.0。在测试时,我们对每个输入图像进行实时分类,并在分类后执行细化步骤。细化阶段的缓冲区大小为16,我们设置α=0.1,这与在辅助域上训练时更新全局批量归一化(GBN)组件所使用的值相同。

We implemented 14 our method with the PyTorch [202] framework and our evaluation is performed using a NVIDIA GeForce 1080 Ti GTX GPU.
我们使用PyTorch框架[202]实现了我们的方法14,并使用NVIDIA GeForce 1080 Ti GTX GPU进行评估。

Results
结果


In this section we report the results of our evaluation, showing both an empirical analysis of the proposed contributions and a comparison with state-of-the-art-approaches.
在本节中,我们报告评估结果,展示对所提出贡献的实证分析以及与最先进方法的比较。

Analysis of AdaGraph. We first analyze the performance of our approach by employing the Portraits dataset. In particular, we evaluate the impact of (i) introducing a graph to predict the target domain BN statistics (AdaGraph BN), (ii) adding scale and bias parameters trained in isolation (AdaGraph SB) or jointly (AdaGraph Full) and (iii) adopting the proposed refinement strategy (AdaGraph + Refinement). As baseline 15 we consider the model trained only on the source domain and, as an upper bound, a corresponding DA method which is allowed to use target data during training. In our case, the upper bound corresponds to a model similar to the method proposed in [28].
AdaGraph分析。我们首先通过使用肖像(Portraits)数据集来分析我们方法的性能。具体而言,我们评估以下方面的影响:(i)引入图来预测目标域的批量归一化(BN)统计信息(AdaGraph BN);(ii)添加单独训练(AdaGraph SB)或联合训练(AdaGraph Full)的尺度和偏置参数;(iii)采用所提出的细化策略(AdaGraph + 细化)。作为基线15,我们考虑仅在源域上训练的模型,作为上限,我们考虑一种允许在训练期间使用目标数据的相应域自适应(DA)方法。在我们的案例中,上限对应于类似于文献[28]中提出的方法的模型。

The results of our ablation are reported in Table 2.17, where we report the average classification accuracy corresponding to two scenarios: across decades (considering the same region for source and target domains) and across regions (considering the same decade for source and target dataset). The first scenario corresponds to 280 experiments, while the second to 160 tests. As shown in the table, by simply replacing the statistics of BN layers of the source model with those predicted through our graph a large boost in accuracy is achieved (+4% in the across decades scenario and +2.4% in the across regions one). At the same time, estimating the scale and bias parameters without considering the graph is suboptimal. In fact there is a misalignment between the forward pass of the training phase (i.e. considering only domain-specific parameters) and how these parameters will be combined at test time (i.e. considering also the connection with the other nodes of the graph). Interestingly, in the across regions setting, our full model slightly drops in performance with respect to predicting only the BN statistics. This is probably due to how regions are encoded in the metadata (i.e. considering geographical location), making it difficult to capture factors (e.g. cultural, historical) which can be more discriminative to characterize the population of a region or a state. However, as stated in Section 2.7.3, employing a continuous refinement strategy allows the method to compensate for prediction errors. As shown in Table 2.17, with a refinement step (AdaGraph + Refinement) the accuracy constantly increases, filling the gap between the performance of the initial model and our DA upper bound.
我们的消融实验结果报告在表2.17中,我们报告了对应两种场景的平均分类准确率:跨年代(源域和目标域考虑相同区域)和跨区域(源数据集和目标数据集考虑相同年代)。第一种场景对应280次实验,而第二种对应160次测试。如表所示,通过简单地用我们的图预测的统计数据替换源模型的批量归一化(Batch Normalization,BN)层的统计数据,在跨年代场景中准确率大幅提升(+4%,在跨区域场景中提升了2.4%)。同时,不考虑图来估计尺度和偏差参数是次优的。实际上,训练阶段的前向传播(即仅考虑特定领域的参数)与这些参数在测试时的组合方式(即还考虑与图中其他节点的连接)之间存在不一致。有趣的是,在跨区域设置中,与仅预测BN统计数据相比,我们的完整模型性能略有下降。这可能是由于区域在元数据中的编码方式(即考虑地理位置),使得难以捕捉对表征一个地区或一个州的人口更具区分性的因素(例如文化、历史因素)。然而,如第2.7.3节所述,采用连续细化策略可以使该方法弥补预测误差。如表2.17所示,通过一个细化步骤(AdaGraph + 细化),准确率持续提高,缩小了初始模型性能与我们的域适应(Domain Adaptation,DA)上限之间的差距。


14 The code is available at https://github.com/mancinimassimiliano/adagraph
14 代码可在https://github.com/mancinimassimiliano/adagraph获取

15 We do not report the results of previous approaches [293] since the code is not publicly available.
15 由于之前方法的代码未公开,我们未报告其结果[293]。




https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_94.jpg?x=276&y=264&w=1073&h=277&r=0

Figure 2.23. Portraits dataset: comparison of different models in the PDA scenario with respect to the average accuracy on a target decade, fixed the same region of source and target domains. The models are based on ResNet-18.
图2.23. 肖像数据集:在部分域适应(Partial Domain Adaptation,PDA)场景中,不同模型在目标年代的平均准确率比较,源域和目标域固定为相同区域。这些模型基于ResNet - 18。

Table 2.17. Portraits dataset. Ablation study.
表2.17. 肖像数据集。消融研究。

MethodAcross DecadesAcross Regions
Baseline82.389.2
AdaGraph BN86.391.6
AdaGraph SB86.090.5
AdaGraph Full87.091.0
Baseline + Refinement86.291.3
AdaGraph + Refinement88.691.9
DA upper bound89.192.1
方法跨数十年跨地区
基线82.389.2
自适应图批量归一化(AdaGraph BN)86.391.6
自适应图子批量归一化(AdaGraph SB)86.090.5
自适应图全量归一化(AdaGraph Full)87.091.0
基线 + 细化86.291.3
自适应图 + 细化88.691.9
领域自适应上限(DA upper bound)89.192.1


It is worth noting that applying the refinement procedure to the source model (Baseline + Refinement) leads to better performance (about +4% in the across decades scenario and +2.1% for across regions one). More importantly, the performance of the Baseline + Refinement method is always worse than what obtained by AdaGraph + Refinement, because our model provides, on average, a better starting point for the refinement procedure.
值得注意的是,对源模型(基线模型 + 细化处理)应用细化程序可带来更好的性能(在跨年代场景中约为+4%,在跨区域场景中提高 2.1%)。更重要的是,基线模型 + 细化处理方法的性能始终不如 AdaGraph + 细化处理方法,因为我们的模型平均而言为细化程序提供了更好的起点。

Figure 2.23 shows the results associated to the across decades scenario. Each bar plot corresponds to experiments where the target domain is associated to a specific year. As shown in the figure, on average, our full model outperforms both AdaGraph BN and AdaGraphSB ,showing the benefit of the proposed graph strategy. The results in the figure clearly also show that our refinement strategy always leads to a boost in performance.
图 2.23 展示了跨年代场景的相关结果。每个柱状图对应目标领域与特定年份相关的实验。如图所示,平均而言,我们的完整模型优于 AdaGraph BNAdaGraphSB,显示了所提出的图策略的优势。图中的结果也清楚地表明,我们的细化策略总能提升性能。

Comparison with the state of the art. Here we compare the performances of our model with state-of-the-art PDA approaches. We use the CompCars dataset and we benchmark against the Multivariate Regression (MRG) methods proposed in [293]. We apply our model in the same setting as [293] and perform 870 different experiments, computing the average accuracy (Table 2.18). Our model outperforms
与现有技术的比较。在此,我们将我们模型的性能与最先进的部分领域适应(PDA)方法进行比较。我们使用 CompCars 数据集,并与文献 [293] 中提出的多元回归(MRG)方法进行基准测试。我们在与文献 [293] 相同的设置下应用我们的模型,并进行 870 次不同的实验,计算平均准确率(表 2.18)。我们的模型表现更优

Table 2.18. CompCars dataset [292]. Comparison with state of the art. denotes Decaf features as input, denotes VGG-Full.
表 2.18. CompCars 数据集 [292]。与现有技术的比较。 表示以 Decaf 特征作为输入, 表示 VGG - 全量特征。

MethodAvg. Accuracy
Baseline [293]54.0
Baseline +BN56.1
MRG-Direct [293]58.1
MRG-Indirect [293]58.2
AdaGraph (metadata) 60.1
AdaGraph (images) 60.8
Baseline + Refinement 59.5
AdaGraph + Refinement 60.9
DA upper bound 60.9
方法平均准确率
基线 [293]54.0
基线 +BN56.1
MRG直接法 [293]58.1
MRG间接法 [293]58.2
自适应图(元数据) 60.1
自适应图(图像) 60.8
基线 + 细化 59.5
自适应图 + 细化 60.9
域适应上限 60.9


the two methods proposed in [293] by improving the performances of the Baseline network by 4%. AdaGraph alone outperforms the Baseline model when it is updated with our refinement strategy and target data (Baseline + Refinement). When coupled with a refinement strategy, our graph-based model further improves the performances, filling the gap between AdaGraph and our DA upper bound. It is interesting to note that our model is also effective when there are no metadata available in the target domain. In the table, AdaGraph (images) corresponds to our approach when,instead of initializing the BN layer for the target exploiting metadata, we employ the current input image and a domain classifier to obtain a probability distribution over the graph nodes, as described in Section 2.7.2. The results in the Table show that AdaGraph (images) is more accurate than AdaGraph (metadata).
[293]中提出的两种方法使基线网络(Baseline network)的性能提高了4%。当使用我们的细化策略和目标数据(基线 + 细化)进行更新时,仅AdaGraph就优于基线模型(Baseline model)。当与细化策略结合使用时,我们基于图的模型进一步提高了性能,缩小了AdaGraph与我们的域适应(DA)上限之间的差距。值得注意的是,当目标领域没有可用的元数据时,我们的模型同样有效。在表中,AdaGraph(图像)对应于我们的方法,即不利用元数据为目标初始化批量归一化(BN)层,而是使用当前输入图像和域分类器来获得图节点上的概率分布,如第2.7.2节所述。表中的结果表明,AdaGraph(图像)比AdaGraph(元数据)更准确。

Exploiting AdaGraph Refinement for Continous Domain Adaptation. In Section 2.7.3, we have shown a way to boost the performances of our model by leveraging the stream of incoming target data and refine the estimates of the target BN statistics and parameters. Throughout the experimental section, we have also demonstrated how this strategy improves the target classification model, with performances close to DA methods which exploit target data during training.
利用AdaGraph细化进行连续域适应。在第2.7.3节中,我们展示了一种通过利用传入的目标数据流来提高我们模型性能的方法,并细化了目标批量归一化(BN)统计量和参数的估计。在整个实验部分,我们还展示了这种策略如何改进目标分类模型,其性能接近在训练期间利用目标数据的域适应(DA)方法。

In this section we show how this approach can be employed as a competitive method in the case of continuous domain adaptation [103]. We consider the CarEv-olution dataset and compare the performances of our proposed strategy with two state of the art algorithms: the manifold-based adaptation method in [103] and the low-rank SVM strategy presented in [139]. As in [139] and [103], we apply our adaptation strategy after classifying each novel image and compute the overall accuracy. The images of the target domain are presented to the network in a chronological order i.e. from 1980 to 2013. The results are shown in Table 2.19. While the integration of a BN layer alone leads to better performances over the baseline, our refinement strategy produces an additional boost of about 3% . If scale and bias parameters are refined considering the entropy loss, accuracy further increases.
在本节中,我们展示了在连续域适应[103]的情况下,这种方法如何作为一种有竞争力的方法使用。我们考虑了汽车进化(CarEvolution)数据集,并将我们提出的策略的性能与两种最先进的算法进行了比较:[103]中基于流形的适应方法和[139]中提出的低秩支持向量机(SVM)策略。与[139]和[103]一样,我们在对每个新图像进行分类后应用我们的适应策略,并计算总体准确率。目标领域的图像按时间顺序(即从1980年到2013年)呈现给网络。结果如表2.19所示。虽然仅集成批量归一化(BN)层就比基线有更好的性能,但我们的细化策略会额外提升约3%。如果考虑熵损失来细化尺度和偏置参数,准确率会进一步提高。

We also test the proposed model on a similar task considering the Portraits dataset. The results of our experiments are shown in Table 2.20. Similarly to what observed on the previous experiments, continuously adapting our deep model as target data become available leads to better performance with respect to the baseline. The refinement of scale and bias parameters contributes to a further boost in accuracy.
我们还在考虑肖像(Portraits)数据集的类似任务上测试了所提出的模型。我们的实验结果如表2.20所示。与之前的实验观察结果类似,随着目标数据的可用,持续调整我们的深度模型相对于基线会带来更好的性能。尺度和偏置参数的细化有助于进一步提高准确率。

Table 2.19. CarEvolution [218]: comparison with state of the art.
表2.19. 汽车进化(CarEvolution)[218]:与最先进方法的比较。

MethodAccuracy
Baseline SVM [139]39.7
Baseline + BN43.7
CMA+GFK [103]43.0
CMA+SA [103]42.7
LLRESVM [139]43.6
LLRESVM+EDA[139]44.3
ONDA (Baseline+Refinement Stats) [166]46.5
Baseline + Refinement Full47.3
方法准确率
基线支持向量机(Baseline SVM) [139]39.7
基线 + 批量归一化(Baseline + BN)43.7
协方差匹配自适应+全局核适配(CMA+GFK) [103]43.0
协方差匹配自适应+子空间对齐(CMA+SA) [103]42.7
局部线性表示支持向量机(LLRESVM) [139]43.6
局部线性表示支持向量机+增强数据扩充(LLRESVM+EDA)[139]44.3
在线无监督域自适应(ONDA,Baseline+Refinement Stats) [166]46.5
基线 + 完全细化(Baseline + Refinement Full)47.3

Table 2.20. Portraits dataset [292]: performances of the refinement strategy on the continuous adaptation scenario
表2.20. 肖像数据集 [292]:细化策略在连续自适应场景下的性能

MethodBaselineRefinement Stats[166]Refinement Full
Accuracy81.987.388.1
方法基线细化统计[166]完全细化
准确率81.987.388.1


2.7.5 Conclusions
2.7.5 结论


We present the first deep architecture for Predictive Domain Adaptation, Ada-Graph. We leverage metadata information to build a graph where each node represents a domain, while the strength of an edge models the similarity among two domains according to their metadata. We then propose to exploit the graph for the purpose of DA and we design novel domain-alignment layers. This framework yields the new state of the art on standard PDA benchmarks. We further present an approach to exploit the stream of incoming target data such as to refine the target model. We show that this strategy itself is also an effective method for continuous DA, outperforming state-of-the-art approaches, and our previous ONDA model. In future works, it would be interesting to explore methodologies to incrementally update the graph and to automatically infer relations among domains, even in the absence of metadata. Moreover, the connections among the nodes can be used in few-shot scenarios, using the relations among domains to provide additional feedback to nodes of domains with few samples.
我们提出了首个用于预测性领域自适应(Predictive Domain Adaptation)的深度架构,即Ada - 图(Ada - Graph)。我们利用元数据信息构建一个图,其中每个节点代表一个领域,而边的强度根据两个领域的元数据对它们之间的相似性进行建模。然后,我们提议利用该图进行领域自适应(DA),并设计了新颖的领域对齐层。该框架在标准的预测性领域自适应(PDA)基准测试中取得了新的最优结果。我们进一步提出了一种利用传入的目标数据来细化目标模型的方法。我们表明,这种策略本身也是一种有效的连续领域自适应(DA)方法,优于现有最优方法以及我们之前的ONDA模型。在未来的工作中,探索逐步更新图并自动推断领域之间关系的方法将是很有趣的,即使在没有元数据的情况下也是如此。此外,节点之间的连接可以在小样本场景中使用,利用领域之间的关系为样本较少的领域节点提供额外的反馈。

This section concludes our works which considered the domain shift problem in isolation both in the presence and in absence of target data and under different settings. In the next chapters, we will describe our works that tackled the semantic shift problem in isolation first (Chapter 3) and coupled with the domain shift one lately (Chapter 4).
本节总结了我们在有和没有目标数据以及不同设置下单独考虑领域偏移问题的工作。在接下来的章节中,我们将描述我们的工作,首先单独处理语义偏移问题(第3章),然后将其与领域偏移问题结合处理(第4章)。

Chapter 3 Recognizing New Semantic Concepts
第3章 识别新的语义概念


This chapter analyzes different problems concerning the extension of a pre-trained architecture to new visual concepts in an incremental fashion, varying the knowledge we want to add and what we want to recognize. As in the previous chapter, we start by providing a general formulation of the problem (Sec. 3.1) and reviewing previous works on incremental learning of classes/tasks and in an open world (Sec. 3.2). In Sec. 3.3 we show how we can extend a model perform the same task (i.e. classification) across multiple visual domains with different output spaces (e.g. digits recognition, street signal classification) through affinely transformed binary mask [172]. This approach extends previous works on multi-domain learning [160], achieving the highest (at the time of acceptance) trade-off among learning new tasks effectively, and using a low number of parameters. In Sec. 3.4 we focus on the incremental class learning problem when new classes are added to the same classification head of old ones but in the context of semantic segmentation [31]. Here we show how there is an inherent problem in this setting caused by the semantic shift of the background class across different incremental steps. We show how this problem can be addressed by a simple modification of the cross-entropy and distillation losses employed in previous approaches [144]. Finally, in Sec. 3.5, we analyze the problem of open-world recognition, where the goal is to not only include new classes incrementally but also to detect if an image belongs to an unknown category. We analyze the problem in robotics scenarios, starting by implementing the first deep approach for this problem [167]. The approach extends standard non-parametric methods [15] and is further improved in a subsequent work by clustering-based losses and class-specific rejection options [69]. In [167] we also discuss how the approach could be employed in a realistic scenario by obtaining datasets with new knowledge directly from the web, a first step towards having agents able to automatically expand their visual recognition capabilities by reasoning on what they see in the real world.
本章分析了关于以增量方式将预训练架构扩展到新的视觉概念的不同问题,这些问题会随着我们想要添加的知识和想要识别的内容而变化。与上一章一样,我们首先对问题进行一般性表述(3.1节),并回顾之前关于类别/任务增量学习以及开放世界中的增量学习的工作(3.2节)。在3.3节中,我们展示了如何通过仿射变换的二进制掩码[172],使模型在具有不同输出空间的多个视觉领域(例如数字识别、街道信号分类)中执行相同的任务(即分类)。这种方法扩展了之前关于多领域学习的工作[160],在有效学习新任务和使用较少参数之间取得了(在接受时)最高的平衡。在3.4节中,我们关注在语义分割[31]的背景下,当新类别被添加到旧类别的同一分类头时的增量类别学习问题。在这里,我们展示了这种设置中存在一个固有的问题,即背景类别在不同增量步骤中的语义偏移所导致的问题。我们展示了如何通过简单修改先前方法中使用的交叉熵和蒸馏损失[144]来解决这个问题。最后,在3.5节中,我们分析开放世界识别问题,其目标不仅是逐步包含新类别,还要检测图像是否属于未知类别。我们在机器人场景中分析这个问题,首先实现了针对该问题的首个深度方法[167]。该方法扩展了标准的非参数方法[15],并在后续工作中通过基于聚类的损失和特定类别的拒绝选项[69]得到了进一步改进。在[167]中,我们还讨论了如何通过直接从网络获取包含新知识的数据集,将该方法应用于现实场景,这是使智能体能够通过对现实世界中所见内容进行推理来自动扩展其视觉识别能力的第一步。

3.1 Problem statement
3.1 问题陈述


Overview. In Chapter 2 we have analyzed multiple algorithms being able to overcome the domain shift problem in various scenarios. However, while the domain shift is a crucial issue for the applicability of visual systems in real scenarios, it deals with one side of the problem: changes in the input distribution without changes in the semantic space. In this chapter, we are interested in tackling the opposite problem. Given a model trained to recognize a set of classes in a given domain, we want to extend its output space, equipping it with the ability to recognize semantic concepts not included in the initial training set.
概述。在第2章中,我们分析了多种能够在各种场景中克服领域偏移问题的算法。然而,虽然领域偏移是视觉系统在现实场景中应用的一个关键问题,但它只涉及问题的一个方面:输入分布的变化而语义空间不变。在本章中,我们感兴趣的是处理相反的问题。给定一个在给定领域中训练用于识别一组类别的模型,我们希望扩展其输出空间,使其具备识别初始训练集中未包含的语义概念的能力。

The methodologies used to add new knowledge to a pre-trained model can be roughly divided into three main categories, depending on the information we have about the training classes. In the first category, we receive data for the novel concepts we want our model to recognize. This scenario is usually called continual/incremental learning [49] and requires to add new knowledge to the model without access to the initial training set and, more importantly, without forgetting previous knowledge [175,114,89] . In the second category,we have models using few sample images of the classes of interests at test time, using the initial training set to learn how to compare this support set composed of few images with a query image. These models fall in the few-shot learning paradigm and require to receive, at test time, sample images of the classes we want to recognize [246,68] . The last category of methods learn to recognize concepts beyond the initial training set without any image available but using class descriptions (e.g. binary attributes [130], word embeddings [180]). In this scenario, called zero-shot learning [278], a model has to map images in a given semantic embedding space where all classes (seen and unseen) are projected. In this way, it is possible to compare images with unseen and/or seen concepts to perform the final classification.
根据我们对训练类别所掌握的信息,向预训练模型添加新知识的方法大致可分为三大类。在第一类中,我们会获取想要模型识别的新类别概念的数据。这种情况通常被称为持续/增量学习 [49],它要求在不访问初始训练集的情况下向模型添加新知识,更重要的是,不能遗忘先前的知识 [175,114,89]。在第二类中,我们的模型在测试时使用少量感兴趣类别的样本图像,并利用初始训练集来学习如何将由少量图像组成的支持集与查询图像进行比较。这些模型属于少样本学习范式,需要在测试时接收我们想要识别的类别的样本图像 [246,68]。最后一类方法无需任何可用图像,但使用类别描述(例如二元属性 [130]、词嵌入 [180])来学习识别初始训练集之外的概念。在这种被称为零样本学习 [278] 的情况下,模型必须将图像映射到一个给定的语义嵌入空间中,所有类别(已见和未见)都被投影到该空间。通过这种方式,就可以将图像与未见和/或已见概念进行比较,以进行最终分类。

In this thesis, we will consider both incremental and zero-shot learning models. In particular, in this chapter, we will consider scenarios where the domain shift is not present, i.e. training and test domains are the same, but the semantic space of the model is incrementally extended over time, as for the first category. In Chapter 4 , we will show how we can obtain a model attacking both domain- and semantic shift, recognizing unseen categories (as in zero-shot learning) in unseen domains (as in domain generalization).
在本论文中,我们将同时考虑增量学习和零样本学习模型。具体而言,在本章中,我们将考虑不存在领域偏移的场景,即训练和测试领域相同,但模型的语义空间会随着时间逐步扩展,就像第一类情况那样。在第 4 章中,我们将展示如何获得一个能够同时应对领域偏移和语义偏移的模型,该模型可以在未见领域(如领域泛化)中识别未见类别(如零样本学习)。

Incremental Learning. Let us formalize the incremental learning problem. Assuming we have a model pre-trained on a set T0={(xi0,yi0)}i=1n0 with xi0X and yi0C0 . Note that X is the input space,as in Section 2.1,while C0 is the output space of the initial training set (i.e. the set of classes in T0 ). Using this set we can obtain a function f0θ:XC0 parametrized by θ and mapping images into the initial output space. To include new concepts in f0θ ,we receive a new dataset containing the new concepts of interests. Since we might perform multiple training steps,let us denote with Tt={(xit,yit)}i=1nt the dataset we receive at time t . Note that,while the input space does not change (i.e. xitX ) the output space does and we have yitCt with CiCj= if ij . After T training learning steps,our goal is to obtain a model fTθ:XYT where the output space YT comprises all the concepts seen until the training step T ,i.e. YT=tCtTt=0 .
增量学习。让我们对增量学习问题进行形式化。假设我们有一个在集合 T0={(xi0,yi0)}i=1n0 上预训练的模型,其中 xi0Xyi0C0。请注意,X 是输入空间,如第 2.1 节所述,而 C0 是初始训练集的输出空间(即 T0 中的类别集合)。使用这个集合,我们可以得到一个由 θ 参数化的函数 f0θ:XC0,它将图像映射到初始输出空间。为了在 f0θ 中纳入新的概念,我们会收到一个包含感兴趣的新概念的新数据集。由于我们可能会执行多个训练步骤,让我们用 Tt={(xit,yit)}i=1nt 表示在时间 t 收到的数据集。请注意,虽然输入空间不变(即 xitX),但输出空间会发生变化,并且如果 ij,则有 yitCtCiCj=。经过 T 个训练学习步骤后,我们的目标是获得一个模型 fTθ:XYT,其中输出空间 YT 包含直到训练步骤 T 所见过的所有概念,即 YT=tCtTt=0
Under this definition, we have different problems, depending on how the output space is built [36]. The first distinction is on the number of classification heads. We have single-head models, where there is a single classification head for all the concepts in YT ,and multi-head,in case we have one head per set of classes Ct . In this latter scenario,despite some exceptions [5,213] ,it is common to give as input to the prediction function the information about the output space of interests, i.e. fTθ:X×ZYT with Z={0,,T} . We will analyze this scenario in the context of Multi-Domain Learning [214, 19], in Section 3.3.
根据这一定义,根据输出空间的构建方式,我们会遇到不同的问题 [36]。第一个区别在于分类头的数量。我们有单头模型,即对于 YT 中的所有概念只有一个分类头;还有多头模型,即对于每组类别 Ct 都有一个分类头。在后一种情况下,尽管有一些例外 [5,213],通常会将感兴趣的输出空间信息,即 fTθ:X×ZYTZ={0,,T},作为输入提供给预测函数。我们将在第 3.3 节的多领域学习 [214, 19] 背景下分析这种情况。

Considering the single-head scenario, the second distinction relates to the limits of YT . In case YT is closed-ended,we have the standard incremental class learning scenario and we ask our model to recognize which to which class in YT our image belongs. We will analyze this setting in Section 3.4, in the task of semantic segmentation. In case YT is open-ended,i.e. the model includes a rejection option for known classes, we are in the open world recognition one, and our model is asked to recognize the class of an image and, eventually, detecting if it belongs to an unknown concept. This scenario will be the focus of Section 3.5.
考虑单头情况,第二个区别与 YT 的范围有关。如果 YT 是封闭的,我们就处于标准的增量类学习场景中,我们要求模型识别我们的图像属于 YT 中的哪个类别。我们将在第 3.4 节的语义分割任务中分析这种设置。如果 YT 是开放的,即模型包含对已知类别的拒绝选项,我们就处于开放世界识别场景中,我们要求模型识别图像的类别,并最终检测它是否属于未知概念。这种场景将是第 3.5 节的重点。

In the following section, we will report the relevant literature for incremental learning, multi-domain learning and open world recognition.
在接下来的部分,我们将介绍增量学习、多领域学习和开放世界识别的相关文献。

3.2 Related Works
3.2 相关工作


Incremental Learning. The problem of catastrophic forgetting [175] has been extensively studied for image classification tasks [49]. Previous works can be grouped in three categories [49]: replay-based [216, 30, 239, 106, 272, 195], regularization-based [118,36,300,144,56] ,and parameter isolation-based [161,160,227] . In replay-based methods,examples of previous tasks are either stored [216,30,106,275] or generated [239,272,195] and then replayed while learning the new task. Parameter isolation-based methods [161,160,227] assign a subset of the parameters to each task to prevent forgetting. Regularization-based methods can be divided in prior-focused and data-focused. The former [300,36,118,3] define knowledge as the parameters value, constraining the learning of new tasks by penalizing changes of important parameters for old ones. The latter [144,56] exploit distillation [102] and use the distance between the activations produced by the old network and the new one as a regularization term to prevent catastrophic forgetting.
增量学习。灾难性遗忘问题 [175] 在图像分类任务中得到了广泛研究 [49]。以往的工作可以分为三类 [49]:基于重放的方法 [216, 30, 239, 106, 272, 195]、基于正则化的方法 [118,36,300,144,56] 和基于参数隔离的方法 [161,160,227]。在基于重放的方法中,先前任务的示例要么被存储 [216,30,106,275] 要么被生成 [239,272,195],然后在学习新任务时进行重放。基于参数隔离的方法 [161,160,227] 为每个任务分配一部分参数以防止遗忘。基于正则化的方法可以分为侧重于先验的方法和侧重于数据的方法。前者 [300,36,118,3] 将知识定义为参数值,通过对旧任务重要参数的变化进行惩罚来约束新任务的学习。后者 [144,56] 利用蒸馏技术 [102],并使用旧网络和新网络产生的激活值之间的距离作为正则化项来防止灾难性遗忘。

Despite these progresses, very few works have gone beyond image-level classification. A first work in this direction is [240] which considers ICL in object detection proposing a distillation-based method adapted from [144] for tackling novel class recognition and bounding box proposals generation. In this work we also take a similar approach to [240] and we resort on distillation. However, here we propose to address the problem of modeling the background shift which is peculiar of the semantic segmentation setting.
尽管取得了这些进展,但很少有工作超越了图像级分类。在这方面的第一项工作是 [240],它考虑了目标检测中的增量类学习(ICL),提出了一种基于蒸馏的方法,该方法改编自 [144],用于处理新类别识别和边界框提议生成。在这项工作中,我们也采用了与 [240] 类似的方法,并借助蒸馏技术。然而,在这里我们提议解决语义分割设置中特有的背景偏移建模问题。

To our knowledge, the problem of ICL in semantic segmentation has been addressed only in [196, 197, 256, 178]. Ozdemir et al. [196, 197] describe an ICL approach for medical imaging, extending a standard image-level classification method [144] to segmentation and devising a strategy to select relevant samples of old datasets for rehearsal. Taras et al. proposed a similar approach for segmenting remote sensing data. Differently, Michieli et al. [178] consider ICL for semantic segmentation in a particular setting where labels are provided for old classes while learning new ones. Moreover, they assume the novel classes to be never present as background in pixels of previous learning steps. These assumptions strongly limit the applicability of their method.
据我们所知,语义分割中的增量类学习(Incremental Class Learning,ICL)问题仅在文献[196, 197, 256, 178]中得到探讨。奥兹德米尔(Ozdemir)等人[196, 197]描述了一种用于医学成像的ICL方法,将标准的图像级分类方法[144]扩展到分割任务,并设计了一种策略来选择旧数据集的相关样本进行排练。塔拉斯(Taras)等人提出了一种类似的方法用于遥感数据分割。不同的是,米基耶利(Michieli)等人[178]考虑在一种特定设置下进行语义分割的ICL,即在学习新类别时为旧类别提供标签。此外,他们假设新类别在之前学习步骤的像素中永远不会作为背景出现。这些假设极大地限制了他们方法的适用性。

Here we propose a more principled formulation of the ICL problem in semantic segmentation. In contrast with previous works, we do not limit our analysis to medical [196] or remote sensing data [256] and we do not impose any restrictions on how the label space should change across different learning steps [178]. Moreover, we are the first to provide a comprehensive experimental evaluation of state of the art ICL methods on commonly used semantic segmentation benchmarks and to explicitly introduce and tackle the semantic shift of the background class, a problem recognized but largely overseen by previous works [178].
在此,我们提出了一种在语义分割中对ICL问题更具原则性的表述。与以往的工作不同,我们不将分析局限于医学数据[196]或遥感数据[256],也不对标签空间在不同学习步骤中应如何变化施加任何限制[178]。此外,我们首次对常用语义分割基准上的现有ICL方法进行了全面的实验评估,并明确引入并解决了背景类别的语义漂移问题,这是一个此前的工作已认识到但基本被忽视的问题[178]。

Multi-domain Learning. Another challenge in incremental learning is extending a pre-trained model to address new tasks, each with different output space. Indeed, the need for visual models capable of addressing multiple domains received a lot of attention in recent years for what concerns both multi-task learning [298, 150, 32] and multi-domain learning [214, 223]. Multi-task learning focuses on learning multiple visual tasks (e.g. semantic segmentation, depth estimation [150]) with a single architecture. On the other hand, the goal of multi-domain learning is building a model able to address a task (e.g. classification) in multiple visual domains (e.g. real photos, digits) without forgetting previous domains and by using fewer parameters possible. An important work in this context is [19], where the authors showed how multi-domain learning can be addressed by using a network sharing all parameters except for batch-normalization (BN) layers [109]. In [214], the authors introduced the Visual Domain Decathlon Challenge, a first multi-domain learning benchmark. The first attempts in addressing this challenge involved domain-specific residual components added in standard residual blocks, either in series [214] or in parallel [215], In [223] the authors propose to use controller modules where the parameters of the base architecture are recombined channel-wise, while in [150] exploits domain-specific attention modules. Other effective approaches include devising instance-specific fine-tuning strategies [96], target-specific architectures [184] and learning covariance normalization layers [140].
多领域学习。增量学习的另一个挑战是将预训练模型扩展以处理新任务,每个任务具有不同的输出空间。实际上,近年来,对于多任务学习[298, 150, 32]和多领域学习[214, 223]而言,能够处理多个领域的视觉模型的需求受到了广泛关注。多任务学习侧重于使用单一架构学习多个视觉任务(例如语义分割、深度估计[150])。另一方面,多领域学习的目标是构建一个能够在多个视觉领域(例如真实照片、数字)中处理一项任务(例如分类)的模型,同时不遗忘先前的领域,并尽可能使用更少的参数。在这方面的一项重要工作是文献[19],作者在其中展示了如何通过使用一个除批量归一化(Batch-Normalization,BN)层[109]外共享所有参数的网络来解决多领域学习问题。在文献[214]中,作者引入了视觉领域十项全能挑战(Visual Domain Decathlon Challenge),这是第一个多领域学习基准。解决这一挑战的最初尝试包括在标准残差块中添加特定领域的残差组件,这些组件可以是串联的[214]或并联的[215]。在文献[223]中,作者提议使用控制器模块,在该模块中基础架构的参数按通道重新组合,而在文献[150]中则利用特定领域的注意力模块。其他有效的方法包括设计特定实例的微调策略[96]、特定目标的架构[184]以及学习协方差归一化层[140]。
In [161] only a reserved subset of network parameters is considered for each domain. The intersection of the parameters used by different domains is empty, thus the network can be trained end-to-end for each domain. Obviously, as the number of domain increases, fewer parameters are available for each domain, with a consequent limitation on the performances of the network. To overcome this issue, in [160] the authors proposed a more compact and effective solution based on directly learning domain-specific binary masks. The binary masks determine which of the network parameters are useful for the new domain and which are not, changing the actual composition of the features extracted by the network. This approach inspired subsequent works, improving both either the power of the binary masks [171] or the number of bits required, masking directly an entire channel [17].
在文献[161]中,每个领域仅考虑网络参数的一个保留子集。不同领域使用的参数交集为空,因此网络可以针对每个领域进行端到端训练。显然,随着领域数量的增加,每个领域可用的参数会减少,从而导致网络性能受到限制。为克服这一问题,在文献[160]中,作者提出了一种更紧凑、有效的解决方案,该方案基于直接学习特定领域的二进制掩码。二进制掩码确定网络参数中哪些对新领域有用,哪些无用,从而改变网络提取的特征的实际组成。这种方法启发了后续的工作,既提高了二进制掩码的能力[171],又减少了所需的比特数,甚至可以直接屏蔽整个通道[17]。

In our work [172], we take inspiration from these last research trends. In particular, we generalize the design of the binary masks employed in [160] and [171] considering neither simple multiplicative binary masks nor simple affine transformations of the original weights [171] but a general and flexible formulation capturing both cases. Experiments show how our approach in [172] leads to a boost in the performances while using a comparable number of parameters per domain. Moreover, our approach achieves performances comparable to more complex models [215,184,140,140] in the challenging Visual Domain Decathlon challenge,largely reducing the gap of binary-mask based methods with the current state of the art. Note that while learning to address the same task (i.e. classification) in multiple visual domains, these trend of works addresses catastrophic forgetting by adding domain-specific parameters, extending the semantic extent of a pre-trained model by exploiting isolated set of parameters. In fact, if the initial network parameters remain untouched, the catastrophic forgetting problem is avoided but at the cost of the additional parameters required. The extreme case is the work of [227] in the context of reinforcement learning, where a parallel network is added each time a new domain is presented with side domain connections, exploited to improve the performances on novel domains. Differently to [227], the mask-based approaches [161, 160, 171, 172] require a much lower overhead in terms of total parameters, showing comparable or even superior results to task-specific fine-tuned models [161, 160, 172].
在我们的研究 [172] 中,我们从这些最新的研究趋势中获得了灵感。具体而言,我们对文献 [160] 和 [171] 中使用的二元掩码设计进行了推广,既不考虑简单的乘法二元掩码,也不考虑对原始权重进行简单的仿射变换 [171],而是采用一种通用且灵活的公式来涵盖这两种情况。实验表明,我们在文献 [172] 中提出的方法如何在每个领域使用相当数量的参数的情况下提高性能。此外,在具有挑战性的视觉领域十项全能挑战赛中,我们的方法取得了与更复杂的模型 [215,184,140,140] 相当的性能,大大缩小了基于二元掩码的方法与当前最先进技术之间的差距。请注意,在学习在多个视觉领域解决相同任务(即分类)时,这些研究趋势通过添加特定领域的参数来解决灾难性遗忘问题,通过利用孤立的参数集来扩展预训练模型的语义范围。事实上,如果初始网络参数保持不变,就可以避免灾难性遗忘问题,但代价是需要额外的参数。极端的情况是文献 [227] 在强化学习背景下的研究,每次出现新领域时都会添加一个并行网络,并带有侧域连接,以提高在新领域的性能。与文献 [227] 不同,基于掩码的方法 [161, 160, 171, 172] 在总参数方面的开销要低得多,与特定任务的微调模型 [161, 160, 172] 相比,显示出相当甚至更优的结果。

Open World Recognition. The necessity of breaking the closed-world assumption (CWA) for robot vision systems [254] has lead various research efforts on understanding how to extend pre-trained models with new semantic concepts while retain previous knowledge and detecting possibly unknown ones. There are two components towards this goal: the first is incrementally adding new categories to the pre-trained model, while the second is maintaining a right estimation of the uncertainty on the predictions allowing to reject inputs of unseen classes. Due to the central role this task has in real-world applications, recent years have seen a growing interests among robotic vision researches on topics such as continual [132] and incremental learning [263, 25, 32]. In [201], the authors study how to update the visual recognition system of a humanoid robot on multiple training sessions. In [25], a variant of the Regularized Least Squares algorithm is introduced to add new classes to a pre-trained model. In [200], a growing dual-memory is proposed to dynamically learn novel object instances and categories. In [126] the authors proposed to learn an embedding in order to perform fast incremental learning of new objects. Another solution to this problem can exploit the help of a human-robot interaction, as in [263] where a robot incrementally learns to detect new objects as they are manually pointed by a human.
开放世界识别。打破机器人视觉系统的封闭世界假设(CWA)[254] 的必要性促使了各种研究工作,旨在理解如何在保留先前知识并检测可能未知概念的同时,用新的语义概念扩展预训练模型。实现这一目标有两个方面:一是逐步向预训练模型中添加新的类别,二是对预测的不确定性进行正确估计,以便拒绝未知类别的输入。由于这项任务在现实世界应用中具有核心作用,近年来,机器人视觉研究人员对诸如持续学习 [132] 和增量学习 [263, 25, 32] 等主题的兴趣日益浓厚。在文献 [201] 中,作者研究了如何在多个训练阶段更新人形机器人的视觉识别系统。在文献 [25] 中,引入了一种正则化最小二乘算法的变体,用于向预训练模型中添加新的类别。在文献 [200] 中,提出了一种不断增长的双记忆模型,用于动态学习新的对象实例和类别。在文献 [126] 中,作者提出学习一种嵌入,以便对新对象进行快速增量学习。解决这个问题的另一种方法可以借助人机交互,如文献 [263] 中,机器人在人类手动指示新对象时逐步学习检测这些新对象。
While these approaches focus on incremental and continual learning, acting in the open world requires both detecting unknown concepts automatically and adding them in subsequent learning stages. Towards this objective, in [15] the authors introduced the OWR setting, as a more general and realistic scenario for agents acting in the real world. In [15], the authors extend the Nearest Class Mean (NCM) classifier [177,95] to act in the open set scenario,proposing the Nearest Non-Outlier algorithm (NNO). In order to estimate whereas a test sample belongs to the known or unknown set of categories, this method introduces a rejection threshold that, after the first initialization phase, is kept fixed for subsequent learning episodes. In [50], the authors proposed to tackle OWR with the Nearest Ball Classifier, with a rejection threshold based on the confidence of the predictions. In [167], we extended the NNO algorithm of [15] by employing an end-to-end trainable deep architecture as feature extractor, with a dynamic update strategy for the rejection threshold. Moreover, our work was the first to consider the collection of datasets containing new knowledge using web resources, towards agent able to automatically include new knowledge with limited to none human supervision. In the subsequent work [69], we showed how we can improve the performances of NCM based classifier for OWR through a global to local clustering loss. Moreover, differently for previous works, we designed class-specific rejection threshold rather are explicitly learned rather than fixed based on heuristic strategies.
虽然这些方法侧重于增量学习和持续学习,但在开放世界中行动需要既能自动检测未知概念,又能在后续学习阶段将其添加进来。为了实现这一目标,在文献 [15] 中,作者引入了开放世界识别(OWR)设置,作为在现实世界中行动的智能体更通用、更现实的场景。在文献 [15] 中,作者将最近类均值(NCM)分类器 [177,95] 扩展到开放集场景中,提出了最近非离群点算法(NNO)。为了估计测试样本是属于已知类别集还是未知类别集,该方法引入了一个拒绝阈值,在第一次初始化阶段之后,该阈值在后续的学习阶段保持固定。在文献 [50] 中,作者提出使用最近球分类器来解决开放世界识别问题,并基于预测的置信度设置拒绝阈值。在文献 [167] 中,我们通过采用端到端可训练的深度架构作为特征提取器,并采用动态更新拒绝阈值的策略,扩展了文献 [15] 中的 NNO 算法。此外,我们的工作首次考虑利用网络资源收集包含新知识的数据集,使智能体能够在很少或无需人工监督的情况下自动纳入新知识。在后续的工作 [69] 中,我们展示了如何通过全局到局部的聚类损失来提高基于 NCM 的开放世界识别分类器的性能。此外,与以往的工作不同,我们设计了特定类别的拒绝阈值,这些阈值是通过显式学习得到的,而不是基于启发式策略固定的。

3.3 Sequential and Memory Efficient Learning of New Datasets 12
3.3 新数据集的顺序和内存高效学习 12


In this section,we focus on the problem of multi-domain learning [214, 19]. Following the problem statement of [214], the goal of multi-domain learning is to train a model to address multiple classification tasks using as few parameters per each of them. In the following, we focus on the case, considered also in [214] where we adapt an initial pre-trained model to address novel tasks sequentially. This capability is crucial for increasing the knowledge of an intelligent system and developing effective incremental [222, 125], life-long [258, 259, 243] learning algorithms. While fascinating, achieving this goal requires facing multiple challenges. First, learning a new task should not negatively affect the performance on old tasks. Second, it should be avoided adding multiple parameters to the model for each new task learned, as it would lead to poor scalability of the framework. In this context, while deep learning algorithms have achieved impressive results on many computer vision benchmarks [124, 98, 83, 152], mainstream approaches for adapting deep models to novel tasks tend to suffer from the problems mentioned above. In fact, fine-tuning a given architecture to new data does produce a powerful model on the novel task, at the expense of a degraded performance on the old ones, resulting in the well-known phenomenon of catastrophic forgetting [71, 89]. At the same time, replicating the network parameters and training a separate network for each task is a powerful approach that preserves performances on old tasks, but at the cost of an explosion of the network parameters [214].
在本节中,我们关注多领域学习问题 [214, 19]。遵循文献 [214] 中的问题陈述,多领域学习的目标是训练一个模型,以尽可能少的参数处理多个分类任务。接下来,我们关注文献 [214] 中也考虑过的情况,即我们对一个初始预训练模型进行调整,以按顺序处理新任务。这种能力对于增加智能系统的知识以及开发有效的增量学习算法 [222, 125] 和终身学习算法 [258, 259, 243] 至关重要。虽然这很有吸引力,但要实现这一目标需要面对多个挑战。首先,学习新任务不应负面影响旧任务的性能。其次,应避免为每个新学习的任务向模型添加多个参数,因为这会导致框架的可扩展性变差。在这种情况下,虽然深度学习算法在许多计算机视觉基准测试中取得了令人瞩目的成果 [124, 98, 83, 152],但将深度模型调整以适应新任务的主流方法往往会受到上述问题的困扰。事实上,将给定架构微调以适应新数据确实能在新任务上产生强大的模型,但代价是旧任务的性能下降,从而导致众所周知的灾难性遗忘现象 [71, 89]。同时,复制网络参数并为每个任务训练一个单独的网络是一种强大的方法,它能保留旧任务的性能,但代价是网络参数数量激增 [214]。

Different works addressed these problems by either considering losses encouraging the preservation of the current weights [144,118] or by designing task-specific network parameters [227, 214, 223, 161, 160]. Interestingly, in [160] the authors showed that an effective strategy for achieving good sequential multi-task learning performances with a minimal increase in term of network size is to create a binary mask for each task. In particular, this mask is then multiplied by the main network weights, determining which of them are useful for addressing the new task.
不同的研究通过考虑鼓励保留当前权重 [144,118] 的损失函数,或者设计特定任务的网络参数 [227, 214, 223, 161, 160] 来解决这些问题。有趣的是,在文献 [160] 中,作者表明,以最小的网络规模增加实现良好的顺序多任务学习性能的有效策略是为每个任务创建一个二进制掩码。具体而言,这个掩码随后会与主网络权重相乘,以确定哪些权重对处理新任务有用。

In this section, we take inspiration from these last work. and we formulate the sequential multi-task learning as the problem of learning a perturbation of a baseline, pre-trained network, in a way to maximize the performance on a new task. Importantly, the perturbation should be compact in the sense of limiting the number of additional parameters required with respect to the baseline network. To this extent, we apply an affine transformation to each convolutional weight of the baseline network, which involves both a learned binary mask and few additional parameters. The binary mask is used as a scaled and shifted additive component and as a multiplicative filter to the original weights. Figure 3.1 shows an example application of our proposed algorithm. Given a network pre-trained on a particular task (i.e. ImageNet [225], orange blocks) we can transform its original weights through binary masks (colored grids) and obtain a network which effectively addresses a novel tasks (e.g. digit [189] or traffic sign [250] recognition) We name our solution BAT (Binary-mask Affinely Transformed for multi-domain learning). This solution allows us to achieve two main goals: 1) boosting the performance of each task-specific network that we train, by leveraging the higher degree of freedom in perturbing the baseline network, while 2) keeping a low per-task overhead in terms of additional parameters (slightly more than 1 bit per parameter per task).
在本节中,我们从上述研究中获得灵感。我们将顺序多任务学习表述为学习一个基线预训练网络的扰动问题,以最大化在新任务上的性能。重要的是,这种扰动应在限制相对于基线网络所需的额外参数数量的意义上是紧凑的。为此,我们对基线网络的每个卷积权重应用仿射变换,这涉及一个学习到的二进制掩码和少量额外参数。二进制掩码用作缩放和平移的加法分量,以及对原始权重的乘法滤波器。图 3.1 展示了我们提出的算法的一个应用示例。给定一个在特定任务(即 ImageNet [225],橙色块)上预训练的网络,我们可以通过二进制掩码(彩色网格)变换其原始权重,从而获得一个能有效处理新任务(例如数字识别 [189] 或交通标志识别 [250])的网络。我们将我们的解决方案命名为 BAT(用于多领域学习的二进制掩码仿射变换)。这个解决方案使我们能够实现两个主要目标:1) 通过利用在扰动基线网络时更高的自由度,提升我们训练的每个特定任务网络的性能;2) 在额外参数方面保持较低的每个任务开销(每个任务每个参数略多于 1 位)。


1 M. Mancini,E.Ricci,B. Caputo,S. Rota Bulò. Adding New Tasks to a Single Network with Weight Transformations using Binary Masks. European Computer Vision Conference Workshop on Transferring and Adapting Source Knowledge in Computer Vision 2018.
1 M. Mancini,E. Ricci,B. Caputo,S. Rota Bulò。使用二进制掩码通过权重变换为单个网络添加新任务。欧洲计算机视觉会议计算机视觉中源知识迁移与适应研讨会 2018 年。

2 M. Mancini,E.Ricci,B. Caputo,S. Rota Bulò. Boosting Binary Masks for Multi-Domain Learning through Affine Transformations. Machine Vision and Applications 2020.
2 M. Mancini,E. Ricci,B. Caputo,S. Rota Bulò。通过仿射变换增强用于多领域学习的二进制掩码。机器视觉与应用 2020 年。




https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_104.jpg?x=267&y=256&w=1116&h=648&r=0

Figure 3.1. Idea behind our BAT approach. A network pre-trained on a given recognition task A (i.e. ImageNet) can be extended to tackle other recognition tasks B (e.g. digits) and C (e.g. traffic sign) by simply transforming the network weights (orange cubes) through task-specific binary masks (colored grids).
图 3.1. 我们的 BAT 方法背后的思路。在给定识别任务 A(即 ImageNet)上预训练的网络可以通过特定任务的二进制掩码(彩色网格)简单地变换网络权重(橙色立方体),从而扩展以处理其他识别任务 B(例如数字)和 C(例如交通标志)。


We assess the validity of BAT, and some variants thereof, on standard benchmarks including the Visual Decathlon Challenge [214]. The experimental results show that our model achieves performances comparable with fine-tuning separate networks for each recognition task on all benchmarks, while retaining a very small overhead in terms of additional parameters per task. Notably, we achieve results comparable to state-of-the-art models on the Visual Decathlon Challenge [214] but without requiring multiple training stages [140] or a large number of task-specific parameters [96,215] .
我们在包括视觉十项全能挑战赛 [214] 在内的标准基准测试中评估了 BAT 及其一些变体的有效性。实验结果表明,我们的模型在所有基准测试中取得的性能与为每个识别任务微调单独网络的性能相当,同时每个任务的额外参数开销非常小。值得注意的是,我们在视觉十项全能挑战赛 [214] 中取得了与最先进模型相当的结果,但无需多个训练阶段 [140] 或大量特定任务的参数 [96,215]

3.3.1 Problem Formulation
3.3.1 问题表述


We address the problem of sequential learning of new tasks, i.e. we modify a baseline network such as, e.g. ResNet-50 pre-trained on the ImageNet classification task, so to maximize its performance on a new task, while limiting the amount of additional parameters needed. The solution we propose exploits the key idea from Piggyback [160] of learning task-specific masks, but instead of pursuing the simple multiplicative transformation of the parameters of the baseline network, we define a parametrized, affine transformation mixing a binary mask and real parameters that significantly increases the expressiveness of the approach, leading to a rich and nuanced ability to adapt the old parameters to the needs of the new tasks. This in turn brings considerable improvements on all the conducted experiments, as we will show in the experimental section, while retaining a reduced, per-task overhead.
我们解决了新任务的顺序学习问题,即我们修改一个基线网络,例如在 ImageNet 分类任务上预训练的 ResNet - 50,以便在新任务上最大化其性能,同时限制所需的额外参数数量。我们提出的解决方案借鉴了 Piggyback [160] 学习特定任务掩码的关键思想,但我们没有采用对基线网络参数进行简单的乘法变换,而是定义了一种参数化的仿射变换,将二进制掩码和实参数混合,这显著提高了该方法的表达能力,从而使旧参数能够丰富而细致地适应新任务的需求。正如我们将在实验部分展示的那样,这反过来在所有进行的实验中带来了相当大的改进,同时每个任务的开销也有所降低。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_105.jpg?x=302&y=265&w=1060&h=589&r=0

Figure 3.2. Overview of the proposed BAT model (best viewed in color). Given a convolutional kernel, we exploit a real-valued mask to generate a domain-specific binary mask. An affine transformation directly applied to the binary masks, which changes their range (through a scale parameter k2 ) and their minimum value (through k1 ). A multiplicative mask applied to the original kernels and the pre-trained kernel themselves are scaled by the factors k3 and k0 respectively. All the different masks are summed to produce the final domain-specific kernel.
图 3.2. 所提出的 BAT 模型概述(彩色视图效果最佳)。给定一个卷积核,我们利用一个实值掩码来生成特定领域的二进制掩码。对二进制掩码直接应用仿射变换,该变换通过缩放参数 k2 改变其范围,并通过 k1 改变其最小值。对原始核应用乘法掩码,并且预训练核本身分别通过因子 k3k0 进行缩放。所有不同的掩码相加得到最终的特定领域核。


Let us assume to be given a pre-trained,baseline network f0(;Θ,Ω0):XY0 assigning a class label in Y0 to elements of an input space X (e.g. images). 3 The parameters of the baseline network are partitioned into two sets: Θ comprises parameters that will be shared for other domains,whereas Ω0 entails the rest of the parameters (e.g. the classifier). Our goal is to learn for each domain i{1,,m} , with a possibly different output space Yi ,a classifier fi(;Θ,Ωi):XYi . Here, Ωi entails the parameters specific for the i th domain,while Θ holds the shareable parameters of the baseline network mentioned above.
假设我们有一个预训练的基线网络 f0(;Θ,Ω0):XY0,它将输入空间 X(例如图像)中的元素分配到 Y0 中的一个类别标签。3 基线网络的参数被划分为两组:Θ 包含将在其他领域共享的参数,而 Ω0 包含其余的参数(例如分类器)。我们的目标是为每个领域 i{1,,m}(可能具有不同的输出空间 Yi)学习一个分类器 fi(;Θ,Ωi):XYi。这里,Ωi 包含第 i 个领域的特定参数,而 Θ 包含上述基线网络的可共享参数。

Each domain-specific network fi shares the same structure of the baseline network f0 ,except for having a possibly differently sized classification layer. For each convolutional layer 4 of f0 with parameters W ,the domain-specific network fi holds a binary mask M ,with the same shape of W ,that is used to mask original filters. The way the mask is exploited to specialize the network filters produces different variants of our model, which we describe in the following.
每个特定领域的网络 fi 与基线网络 f0 具有相同的结构,只是分类层的大小可能不同。对于 f0 中具有参数 W 的每个卷积层 4,特定领域的网络 fi 持有一个与 W 形状相同的二进制掩码 M,用于屏蔽原始滤波器。利用掩码来专门化网络滤波器的方式产生了我们模型的不同变体,我们将在下面进行描述。


3 We focus on classification tasks,but the proposed method applies also to other tasks.
3 我们专注于分类任务,但所提出的方法也适用于其他任务。

4 Fully-connected layers are a special case.
4 全连接层是一种特殊情况。


3.3.2 Affine Weight Transformation through Binary Masks
3.3.2 通过二进制掩码进行仿射权重变换


Following previous works [160],we consider domain-specific networks fi that are shaped as the baseline network f0 and we store in Ωi a binary mask M for each convolutional kernel W in the shared set Θ . However,differently from [160],we consider a more general affine transformation of the base convolutional kernel W that depends on a binary mask M as well as additional parameters. Specifically,we transform W into
遵循先前的研究成果 [160],我们考虑特定领域网络 fi,其结构与基线网络 f0 相同,并且我们在 Ωi 中为共享集合 Θ 中的每个卷积核 W 存储一个二进制掩码 M。然而,与 [160] 不同的是,我们考虑对基础卷积核 W 进行更通用的仿射变换,该变换既依赖于二进制掩码 M,也依赖于额外的参数。具体而言,我们将 W 变换为

(3.1)W~=k0W+k11+k2M+k3WM,

where kjR are additional domain-specific parameters in Ωi that we learn along with the binary mask M,1 is an opportunely sized tensor of 1s ,and is the Hadamard (or element-wise) product. The transformed parameters W^ are then used in the convolutional layer of fi . We highlight that the domain-specific parameters that are stored in Ωi amount to just a single bit per parameter in each convolutional layer plus a few scalars per layer, yielding a low overhead per additional domain while retaining a sufficient degree of freedom to build new convolutional weights. Figure 3.2 provides an overview of the transformation in (3.1).
其中 kjRΩi 中的额外特定领域参数,我们会与二进制掩码 M,1 一起学习这些参数,1s 是一个大小合适的张量, 是哈达玛积(或逐元素积)。然后,变换后的参数 W^ 将用于 fi 的卷积层。我们强调,存储在 Ωi 中的特定领域参数,在每个卷积层中每个参数仅占一位,再加上每层的几个标量,在为每个额外领域带来较低开销的同时,仍保留了足够的自由度来构建新的卷积权重。图 3.2 概述了式 (3.1) 中的变换。

Our model, can be regarded as a parametrized generalization of [160], since we can recover the formulation of [160] by setting k0,1,2=0 and k3=1 . Similarly, if we get rid of the multiplicative component,i.e. we set k3=0 ,we obtained the following transformation
我们的模型可以被视为 [160] 的参数化推广,因为我们可以通过设置 k0,1,2=0k3=1 来恢复 [160] 的公式。类似地,如果我们去掉乘法分量,即设置 k3=0,我们可以得到以下变换

(3.2)Wˇ=k0W+k11+k2M,

which corresponds to a simpler but still effective version of our method (presented in [171]) and will be taken into account in our analysis.
这对应于我们方法的一个更简单但仍然有效的版本(在 [171] 中提出),并且将在我们的分析中予以考虑。

We want to highlight that each model (i.e. [160], BAT, and its simplified version) has different representation capabilities. In fact, in [160], the domain-specific parameters can take only two possible values: either 0 (i.e. if m=0 ) or the original pre-trained weights (i.e. if m=1 ). On the other hand,the scalar components of our simple model [171] allow both scaling (i.e. with k0 ) and shifting (i.e. with k1 ) the original network weights,with the additive binary mask adding a bias term (i.e. k2 ) selectively to a group of parameters (i.e. the one with m=1 ). BAT generalizes [160] and [171] by considering the multiplicative binary-mask term WM as an additional bias component scaled by the scalar k3 . In this way,our model has the possibility to obtain parameter-specific bias components, something that was not possible neither in [160] nor in [171]. The additional degrees of freedom makes the search space of our method larger with respect to [160,171] ,with the possibility to express more complex (and tailored) domain-specific transformations. Thus, as we show in the experimental section, the additional parameters that we introduce with our method bring a negligible per-domain overhead compared to [160] and [171], which is nevertheless generously balanced out by a significant boost of the performance of the domain-specific classifiers.
我们想强调的是,每个模型(即 [160]、BAT 及其简化版本)具有不同的表示能力。事实上,在 [160] 中,特定领域参数只能取两个可能的值:要么为 0(即如果 m=0),要么为原始预训练权重(即如果 m=1)。另一方面,我们的简单模型 [171] 的标量分量允许对原始网络权重进行缩放(即使用 k0)和平移(即使用 k1),加法二进制掩码会选择性地为一组参数(即 m=1 对应的参数)添加一个偏置项(即 k2)。BAT 通过将乘法二进制掩码项 WM 视为由标量 k3 缩放的额外偏置分量,对 [160] 和 [171] 进行了推广。通过这种方式,我们的模型有可能获得特定于参数的偏置分量,这在 [160] 和 [171] 中都是不可能实现的。与 [160,171] 相比,额外的自由度使我们方法的搜索空间更大,从而有可能表达更复杂(且量身定制)的特定领域变换。因此,正如我们在实验部分所示,与 [160] 和 [171] 相比,我们的方法引入的额外参数为每个领域带来的开销可以忽略不计,但特定领域分类器的性能却得到了显著提升,足以弥补这一开销。

Finally, following [19], we opt also for domain-specific batch-normalization parameters (i.e. mean, variance, scale and bias), unless otherwise stated. Those parameters will not be fixed (i.e. they do not belong to Θ ) but are part of Ωi ,and thus optimized for each domain. In the cases where we have a convolutional layer followed by batch normalization,we keep the corresponding parameter k0 fixed to 1,because the output of batch normalization is invariant to the scale of the convolutional weights.
最后,遵循 [19] 的做法,除非另有说明,我们也选择特定领域的批量归一化参数(即均值、方差、缩放因子和偏置)。这些参数不会固定(即它们不属于 Θ),而是 Ωi 的一部分,因此会针对每个领域进行优化。在我们有一个卷积层后接批量归一化层的情况下,我们将相应的参数 k0 固定为 1,因为批量归一化的输出对卷积权重的缩放是不变的。

3.3.3 Learning Binary Masks
3.3.3 学习二进制掩码


Given the training set of the ith domain,we learn the domain-specific parameters Ωi by minimizing a standard supervised loss,i.e. the classification log-loss. However, while the domain-specific batch-normalization parameters can be learned by employing standard stochastic optimization methods, the same is not feasible for the binary masks. Indeed, optimizing the binary masks directly would turn the learning into a combinatorial problem. To address this issue, we follow the solution adopted in [160], i.e. we replace each binary mask M with a thresholded real matrix R. By doing so, we shift from optimizing discrete variables in M to continuous ones in R . However, the gradient of the hard threshold function h(r)=1r0 is zero almost everywhere, making this solution apparently incompatible with gradient-based optimization approaches. To address this issue we consider a strictly increasing, surrogate function h~ that will be used in place of h only for the gradient computation,i.e.
给定ith领域的训练集,我们通过最小化标准的监督损失(即分类对数损失)来学习特定领域的参数Ωi。然而,虽然特定领域的批量归一化参数可以通过采用标准的随机优化方法来学习,但对于二进制掩码来说,同样的方法并不可行。实际上,直接优化二进制掩码会使学习变成一个组合问题。为了解决这个问题,我们采用了文献[160]中提出的解决方案,即我们用一个经过阈值处理的实矩阵R来代替每个二进制掩码M。通过这样做,我们从优化M中的离散变量转变为优化R中的连续变量。然而,硬阈值函数h(r)=1r0的梯度几乎处处为零,这使得该解决方案显然与基于梯度的优化方法不兼容。为了解决这个问题,我们考虑一个严格递增的替代函数h~,该函数仅用于梯度计算,以替代h,即

h(r)h~(r),

where h denotes the derivative of h with respect to its argument. The gradient that we obtain via the surrogate function has the property that it always points in the right down hill direction in the error surface. Let r be a single entry of R ,with m=h(r) and let E(m) be the error function. Then
其中h表示h关于其自变量的导数。我们通过替代函数得到的梯度具有这样的性质:它总是指向误差表面的正确下坡方向。设rR的单个元素,其中m=h(r),并设E(m)为误差函数。那么

sgn((Eh)(r))=sgn(E(m)h(r))=sgn(E(m)h~(r))

and,since h~(r)>0 by construction of h~ ,we obtain the sign agreement
并且,由于根据h~的构造有h~(r)>0,我们得到符号一致性

sgn((Eh)(r))=sgn(E(m)).

Accordingly,when the gradient of E(h(r)) with respect to r is positive (negative), this induces a decrease (increase) of r . By the monotonicity of h this eventually induces a decrease (increase) of m ,which is compatible with the direction pointed by the gradient of E with respect to m .
因此,当E(h(r))关于r的梯度为正(负)时,这会导致r减小(增大)。由于h的单调性,这最终会导致m减小(增大),这与E关于m的梯度所指向的方向是一致的。

In the experiments,we set h~(x)=x ,i.e. the identity function,recovering the workaround suggested in [100] and employed also in [160]. However, other choices are possible. For instance,by taking h~(x)=(1+ex)1 ,i.e. the sigmoid function,we obtain a better approximation that has been suggested in [90,16] . We test different choices for h~(x) in the experimental section.
在实验中,我们设置h~(x)=x,即恒等函数,这恢复了文献[100]中提出并在文献[160]中也采用的解决方法。然而,也可以有其他选择。例如,通过选择h~(x)=(1+ex)1,即Sigmoid函数,我们可以得到一个更好的近似,这在文献[90,16]中已有建议。我们在实验部分测试了h~(x)的不同选择。

3.3.4 Experimental results
3.3.4 实验结果


Datasets. In the following we test our method on two different multi-task benchmarks, where the multiple tasks regard different classification objectives and/or domains. For the first benchmark we follow [160], and we use 6 datasets: ImageNet [225], VGG-Flowers [191], Stanford Cars [122], Caltech-UCSD Birds (CUBS) [269], Sketches [63] and WikiArt [231]. VGG-Flowers [191] is a dataset of fine-grained recognition containing images of 102 categories, corresponding to different kind of flowers. There are 2'040 images for training and 6'149 for testing. Stanford Cars [122] contains images of 196 different types of cars with approximately 8 thousand images for training and 8 thousands for testing. Caltech-UCSD Birds [269] is another dataset of fine-grained recognition containing images of 200 different species of birds, with approximately 6 thousands images for training and 6 thousands for testing. Sketches [63] is a dataset composed of 20 thousands sketch drawings, 16 thousands for training and 4 thousands for testing. It contains images of 250 different objects in their sketched representations. WikiArt [231] contains painting from 195 different artists. The dataset has 42'129 images for training and 10628 images for testing. These datasets contain a lot of variations both from the category addressed (i.e. cars [122] vs birds [269]) and the appearance of their instances (from natural images [225] to paintings [231] and sketches [63]), thus representing a challenging benchmark for sequential multi-task learning techniques.
数据集。接下来,我们在两个不同的多任务基准测试上测试我们的方法,其中多个任务涉及不同的分类目标和/或领域。对于第一个基准测试,我们参考文献[160],并使用6个数据集:ImageNet [225]、VGG花卉数据集(VGG - Flowers)[191]、斯坦福汽车数据集(Stanford Cars)[122]、加州理工学院 - 加州大学圣地亚哥分校鸟类数据集(Caltech - UCSD Birds,CUBS)[269]、草图数据集(Sketches)[63]和维基艺术数据集(WikiArt)[231]。VGG花卉数据集[191]是一个细粒度识别数据集,包含102个类别的图像,对应不同种类的花卉。有2040张图像用于训练,6149张用于测试。斯坦福汽车数据集[122]包含196种不同类型汽车的图像,大约有8000张图像用于训练,8000张用于测试。加州理工学院 - 加州大学圣地亚哥分校鸟类数据集[269]是另一个细粒度识别数据集,包含200种不同鸟类的图像,大约有6000张图像用于训练,6000张用于测试。草图数据集[63]由20000张草图组成,16000张用于训练,4000张用于测试。它包含250种不同物体的草图表示图像。维基艺术数据集[231]包含195位不同艺术家的绘画作品。该数据集有42129张图像用于训练,10628张用于测试。这些数据集在处理的类别(如汽车[122]与鸟类[269])和实例的外观(从自然图像[225]到绘画[231]和草图[63])方面都有很大的差异,因此为顺序多任务学习技术提供了一个具有挑战性的基准测试。
The second benchmark is the Visual Decathlon Challenge [214]. This challenge has been introduced in order to check the capability of a single algorithm to tackle 10 different classification tasks. The tasks are taken from the following datasets: ImageNet [225], CIFAR-100 [123], Aircraft [159], Daimler pedestrian classification (DP) [187], Describable textures (DTD) [44], German traffic signs (GTS) [250] , Omniglot [128], SVHN [189], UCF101 Dynamic Images [18, 248] and VGG-Flowers [191]. A more detailed description of the challenge and the datasets can be found in [214]. For this challenge, an independent scoring function is defined [214]. This function S is expressed as:
第二个基准测试是视觉十项全能挑战赛(Visual Decathlon Challenge)[214]。引入这项挑战赛是为了检验单一算法处理10种不同分类任务的能力。这些任务取自以下数据集:ImageNet[225]、CIFAR - 100[123]、飞机数据集(Aircraft)[159]、戴姆勒行人分类数据集(Daimler pedestrian classification,DP)[187]、可描述纹理数据集(Describable textures,DTD)[44]、德国交通标志数据集(German traffic signs,GTS)[250]、Omniglot[128]、街景门牌号数据集(SVHN)[189]、UCF101动态图像数据集[18, 248]和VGG花卉数据集(VGG - Flowers)[191]。关于该挑战赛和数据集的更详细描述可在文献[214]中找到。针对这项挑战赛,定义了一个独立的评分函数[214]。该函数S表示为:

(3.3)S=d=110αdmax{0,EdmaxEd}2

where Edmax is the test error of the baseline in the domain d,Ed is the test error of the submitted model and α is a scaling parameter ensuring that the perfect score for each task is 1000 , thus with a maximum score of 10000 for the whole challenge. The baseline error is computed doubling the error of 10 independent models fine-tuned on the single tasks. This score function takes into account the performances of a model on all 10 classes, preferring models with good performances on all of them compared to models outperforming by a large margin the baseline in just few of them. Following [17], we use this metric also for the first benchmark, keeping the same upper-bound of 1000 points for each task. Moreover, as in [17], we report the ratio among the score obtained and the parameters used,denoting it as Sp . This metric allows to capture the trade-off among the performances and model size.
其中Edmax是基准模型在领域d,Ed中的测试误差,d,Ed是提交模型的测试误差,α是一个缩放参数,确保每个任务的完美分数为1000分,因此整个挑战赛的最高分数为10000分。基准误差是通过将在单个任务上微调的10个独立模型的误差加倍计算得出的。这个评分函数考虑了模型在所有10个类别上的性能,与仅在少数类别上大幅超越基准的模型相比,更倾向于在所有类别上都有良好表现的模型。遵循文献[17],我们也将这个指标用于第一个基准测试,每个任务的上限仍保持为1000分。此外,与文献[17]一样,我们报告获得的分数与使用的参数之间的比率,将其表示为Sp。这个指标可以反映性能和模型大小之间的权衡。

Networks and training protocols. For the first benchmark, we use 3 networks: ResNet-50 [98], DenseNet-121 [107] and VGG-16 [245], reporting the results of Piggyback [160], PackNet [161] and both the simple [171] and full version of our model (BAT).
网络和训练协议。对于第一个基准测试,我们使用3种网络:ResNet - 50[98]、DenseNet - 121[107]和VGG - 16[245],并报告Piggyback[160]、PackNet[161]以及我们模型(BAT)的简单版本和完整版本的结果。

Following the protocol of [160], for all the models we start from the networks pre-trained on ImageNet and train the task-specific networks using Adam [116] as optimizer except for the classifiers where SGD [21] with momentum is used. The networks are trained with a batch size of 32 and an initial learning rate of 0.0001 for Adam and 0.001 for SGD with momentum 0.9. Both the learning rates are decayed by a factor of 10 after 15 epochs. In this scenario we use input images of size 224×224 pixels,with the same data augmentation (i.e. mirroring and random rescaling) of [161,160] . The real-valued masks are initialized with random values drawn from a uniform distribution with values between 0.0001 and 0.0002 . Since our model is independent on the order of the tasks, we do not take into account different possible orders, reporting the results as accuracy averaged across multiple runs. For simplicity, in the following we will denote this scenario as ImageNet-to-Sketch.
遵循文献[160]的协议,对于所有模型,我们从在ImageNet上预训练的网络开始,使用Adam优化器[116]训练特定任务的网络,但分类器使用带有动量的随机梯度下降法(SGD)[21]。网络以32的批量大小进行训练,Adam的初始学习率为0.0001,带有动量0.9的SGD的初始学习率为0.001。两种学习率在15个训练周期后都衰减为原来的1/10。在这种情况下,我们使用大小为224×224像素的输入图像,并采用与文献[161,160]相同的数据增强方法(即镜像和随机缩放)。实值掩码初始化为从0.0001到0.0002的均匀分布中随机抽取的值。由于我们的模型与任务顺序无关,我们不考虑不同的可能顺序,而是将结果报告为多次运行的平均准确率。为简单起见,在下面我们将这种情况称为从ImageNet到草图(ImageNet - to - Sketch)。
For the Visual Decathlon we employ the Wide ResNet-28 [297] adopted by previous methods [214,223,160] ,with a widening factor of 4 (i.e. 64,128 and 256 channels in each residual block). Following [214] we rescale the input images to 72×72 pixels giving as input to the network images cropped to 64×64 . We follow the protocol in [160], by training the simple and full versions of our model for 60 epochs for each task, with a batch size of 32 , and using again Adam for the entire architecture but the classifier, where SGD with momentum is used. The same learning rates of the first benchmark are adopted and are decayed by a factor of 10 after 45 epochs. The same initialization scheme is used for the real-valued masks. No hyperparameter tuning has been performed as we used a single training schedule for all the 10 tasks, except for the ImageNet pre-trained model, which was trained following the schedule of [214]. As for data augmentation, mirroring has been performed, except for the datasets with digits (i.e. SVHN), signs (Omniglot, GTS) and textures (i.e. DTD) as it may be rather harmful (as in the first 2 cases) or unnecessary.
对于视觉十项全能挑战赛,我们采用先前方法[214,223,160]所使用的宽残差网络(Wide ResNet - 28)[297],其加宽因子为4(即每个残差块中有64、128和256个通道)。遵循文献[214],我们将输入图像重新缩放为72×72像素,并将裁剪为64×64的图像输入到网络中。我们遵循文献[160]的协议,针对每个任务对我们模型的简单版本和完整版本进行60个训练周期的训练,批量大小为32,并且再次对整个架构使用Adam优化器,但分类器使用带有动量的SGD。采用与第一个基准测试相同的学习率,并在45个训练周期后将其衰减为原来的1/10。实值掩码使用相同的初始化方案。由于我们对所有10个任务使用单一的训练计划,因此没有进行超参数调整,除了在ImageNet上预训练的模型,该模型按照文献[214]的计划进行训练。至于数据增强,除了包含数字的数据集(即SVHN)、符号数据集(Omniglot、GTS)和纹理数据集(即DTD)之外,都进行了镜像操作,因为在前面两种情况下镜像可能有害,而在纹理数据集中可能不必要。

In both benchmarks, we train our network on one task at the time, sequentially for all tasks. For each task we introduce the task specific binary masks and additional scalar parameters, as described in section 3.3.2. Moreover, following previous approaches [214, 215, 160, 223], we consider a separate classification layers for each task. This is reflected also in the computation of the parameters overhead required by our model, we do not consider the separate classification layers, following comparison systems [214, 215, 160, 223].
在这两个基准测试中,我们一次在一个任务上训练我们的网络,按顺序对所有任务进行训练。对于每个任务,我们引入特定于任务的二进制掩码和额外的标量参数,如3.3.2节所述。此外,遵循先前的方法[214, 215, 160, 223],我们为每个任务考虑单独的分类层。这也反映在我们模型所需的参数开销计算中,遵循比较系统[214, 215, 160, 223],我们不考虑单独的分类层。

Results
结果


ImageNet-to-Sketch. In the following we discuss the results obtained by our model on the ImageNet-to-Sketch scenario. We compare our method with Piggyback [160], PackNet [161] and two baselines considering the network only as feature extractor, training only the task-specific classifier, and individual networks separately fine-tuned on each task. PackNet [161] adds a new task to an architecture by identifying important weights for the task, optimizing the architecture through alternated pruning and re-training steps. Since this algorithm is dependent on the order of the task, we report the performances for two different orderings [160]: starting from the model pre-trained on ImageNet,in the first setting () the order is CUBS-Cars-Flowers-WikiArt-Sketch while for the second () the order is reversed. For our model, we evaluate both the full and the simple version, including task-specific batch-normalization layers. Since including batch-normalizatin layers affects the performances, for the sake of presenting a fair comparison, we report also the results of Piggyback [160] obtained as a special case of our model with separate BN parameters per task for ResNet-50 and DenseNet-121. Moreover, we report the results of the Budget-Aware adapters (BA2) method in [17]. This method relies on binary masks applied not per-parameter but per-channel, with a budget constraint allowing to further squeeze the network complexity. As in our method, also in [17] task-specific BN layers are used.
ImageNet到草图。接下来,我们讨论我们的模型在ImageNet到草图场景中获得的结果。我们将我们的方法与Piggyback [160]、PackNet [161]以及两个基线方法进行比较,这两个基线方法仅将网络视为特征提取器,仅训练特定于任务的分类器,以及分别在每个任务上进行微调的独立网络。PackNet [161]通过识别任务的重要权重,通过交替的剪枝和重新训练步骤优化架构,将新任务添加到架构中。由于该算法依赖于任务的顺序,我们报告两种不同顺序的性能[160]:从在ImageNet上预训练的模型开始,在第一种设置()中顺序是CUBS - 汽车 - 花卉 - WikiArt - 草图,而在第二种设置()中顺序相反。对于我们的模型,我们评估完整版本和简单版本,包括特定于任务的批量归一化层。由于包含批量归一化层会影响性能,为了进行公平比较,我们还报告了Piggyback [160]作为我们模型的特殊情况获得的结果,即ResNet - 50和DenseNet - 121每个任务使用单独的BN参数。此外,我们报告了[17]中预算感知适配器(BA2)方法的结果。该方法依赖于按通道而非按参数应用的二进制掩码,并带有预算约束,允许进一步压缩网络复杂度。与我们的方法一样,[17]中也使用了特定于任务的BN层。
Results are reported in Tables 3.1, 3.2 and 3.3. We see that both versions of our model are able to fill the gap between the classifier only baseline and the individual fine-tuned architectures, almost entirely in all settings. For larger and more diverse datasets such as Sketch and WikiArt, the gap is not completely covered, but the distance between our models and the individual architectures is always less than 1%. These results are remarkable given the simplicity of our method, not involving any assumption of the optimal weights per task [161, 144], and the small overhead in terms of parameters that we report in the row "# Params" (i.e. 1.17 for ResNet-50, 1.21 for DenseNet-121 and 1.16 for VGG-16), which represents the total number of parameters (counting all tasks and excluding the classifiers) relative to the ones in the baseline network 5 .
结果报告在表3.1、3.2和3.3中。我们看到,我们模型的两个版本都能够缩小仅使用分类器的基线和单独微调架构之间的差距,在所有设置中几乎完全缩小。对于像草图和WikiArt这样更大且更多样化的数据集,差距没有完全消除,但我们的模型与单独架构之间的距离始终小于1%。考虑到我们方法的简单性,不涉及每个任务最优权重的任何假设[161, 144],以及我们在“# 参数”行中报告的参数开销较小(即ResNet - 50为1.17,DenseNet - 121为1.21,VGG - 16为1.16),这些结果非常显著,该参数表示相对于基线网络5中参数的总参数数量(计算所有任务并排除分类器)。

For what concerns the comparison with the other algorithms, our model consistently outperforms both the basic version of Piggyback and PackNet in all the settings and architectures, with the exception of Sketch for the DenseNet and VGG- 16 architectures and CUBS for VGG-16, in which the performances are comparable with those of Piggyback. When task-specific BN parameters are introduced also for Piggyback (Tables 3.1 and 3.2), the gap in performances is reduced, with comparable performances in some settings (i.e. CUBS) but with still large gaps in others (i.e. Flowers, Stanford Cars and WikiArt). These results show that the advantages of our model are not only due to the additional BN parameters, but also to the more flexible and powerful affine transformation introduced.
关于与其他算法的比较,我们的模型在所有设置和架构中始终优于Piggyback和PackNet的基本版本,但DenseNet和VGG - 16架构的草图任务以及VGG - 16的CUBS任务除外,在这些任务中,性能与Piggyback相当。当Piggyback也引入特定于任务的BN参数时(表3.1和3.2),性能差距减小,在某些设置中(即CUBS)性能相当,但在其他设置中(即花卉、斯坦福汽车和WikiArt)仍然存在较大差距。这些结果表明,我们模型的优势不仅在于额外的BN参数,还在于引入的更灵活、更强大的仿射变换。

This statement is further confirmed with the VGG-16 experiments of Table 3.3. For this network, when the standard Piggyback model is already able to fill the gap between the feature extractor baseline and the individual architectures, our model achieves either comparable or slightly superior performances (i.e. CUBS, WikiArt and Sketch). However in the scenarios where Piggyback does not reach the performances of the independently fine-tuned models (i.e. Stanford Cars and Flowers), our model consistently outperform the baseline, either halving (Flowers) or removing (Stanford Cars) the remained gap. Since this network does not contain batch-normalization layers, it confirms the generality of our model, showing the advantages of both our simple and full versions, even without task-specific BN layers.
表3.3的VGG - 16实验进一步证实了这一说法。对于这个网络,当标准的Piggyback模型已经能够缩小特征提取器基线和单独架构之间的差距时,我们的模型实现了相当或略优的性能(即CUBS、WikiArt和草图)。然而,在Piggyback无法达到独立微调模型性能的场景中(即斯坦福汽车和花卉),我们的模型始终优于基线,要么将剩余差距减半(花卉),要么消除剩余差距(斯坦福汽车)。由于这个网络不包含批量归一化层,这证实了我们模型的通用性,显示了我们简单版本和完整版本的优势,即使没有特定于任务的BN层。

For what concerns the comparison with BA2 ,the performances of our model are either comparable or superior in most of the settings. Remarkable are the gaps in the WikiArt dataset,with our full model surpassing BA2 by 3% with ResNet-50 and 4% for DenseNet-121. Despite both Piggyback and BA2 use fewer parameters than our approach, our full model outperforms both of them in terms of the final score (Score row) and the ratio among the score and the parameters used (Score/Params row). This shows that our model is the most powerful in making use of the binary masks, achieving not only higher performances but also a more favorable trade-off with the model size.
就与BA2的比较而言,在大多数设置中,我们模型的性能相当或更优。值得注意的是在WikiArt数据集上的差距,我们的完整模型在使用ResNet - 50时比BA2高出3%,在使用DenseNet - 121时高出4%。尽管Piggyback和BA2使用的参数都比我们的方法少,但我们的完整模型在最终得分(得分行)以及得分与所用参数的比率(得分/参数行)方面都优于它们。这表明我们的模型在利用二进制掩码方面最强大,不仅实现了更高的性能,而且在模型大小方面实现了更有利的权衡。


5 If the base architecture contains Np parameters and the additional bits introduced per task are Ap then # Params =1+Ap(T1)32Np ,where T denotes the number of tasks (included the one used for pre-training the network) and the 32 factor comes from the bits required for each real number. The classifiers are not included in the computation.
5 如果基础架构包含Np个参数,并且每个任务引入的额外比特数为Ap,那么#参数=1+Ap(T1)32Np,其中T表示任务的数量(包括用于预训练网络的任务),32这个因子来自每个实数所需的比特数。分类器不包含在计算中。



Table 3.1. Accuracy of ResNet-50 architectures in the ImageNet-to-Sketch scenario.
表3.1. ImageNet到草图场景中ResNet - 50架构的准确率。

DatasetClassifier Only [160]PackNet[160]PiggybackBA2 [17]BATIndividual [160]
[160]BNSimpleFull
#Params11.101.161.171.031.171.176
ImageNet76.275.775.776.276.276.276.276.276.2
CUBS70.780.471.480.482.181.282.682.482.8
Stanford Cars52.886.180.088.190.692.191.591.491.8
Flowers86.093.090.693.595.295.796.596.796.6
WikiArt55.669.470.373.474.172.374.875.375.6
Sketch50.976.278.779.479.479.380.280.280.8
Score53373262093411841265143014581500
Score/Params5336655348051012122812221246250
数据集仅分类器 [160]PackNet(打包网络)[160]搭载式(Piggyback)BA2 [17]BAT(可能为特定缩写,需结合上下文确定准确含义)单独的 [160]
[160]批量归一化(Batch Normalization,BN)简单的完整的
参数数量(#Params)11.101.161.171.031.171.176
ImageNet(图像网络)76.275.775.776.276.276.276.276.276.2
CUBS(可能为特定数据集名称,需结合上下文确定准确含义)70.780.471.480.482.181.282.682.482.8
斯坦福汽车数据集(Stanford Cars)52.886.180.088.190.692.191.591.491.8
花卉数据集86.093.090.693.595.295.796.596.796.6
WikiArt(维基艺术)55.669.470.373.474.172.374.875.375.6
草图50.976.278.779.479.479.380.280.280.8
得分53373262093411841265143014581500
得分/参数5336655348051012122812221246250

Table 3.2. Accuracy of DenseNet-121 architectures in the ImageNet-to-Sketch scenario.
表3.2. DenseNet - 121架构在ImageNet到草图场景中的准确率。

DatasetClassifierPackNet [160]PiggybackBA2 [17]BATIndividual [160]
Only [160][160]BNSimpleFull
#Params11.111.151.211.171.211.216
ImageNet74.474.474.474.474.474.474.474.474.4
CUBS73.580.769.679.781.482.481.581.781.9
Stanford Cars56.884.777.987.290.192.991.791.691.4
Flowers83.491.191.594.395.596.096.796.996.5
WikiArt54.966.369.272.073.971.575.575.776.4
Sketch53.174.778.980.079.179.979.979.880.5
Score32468560794612091434150615341500
Score/Params324617547822999122612451268250
数据集分类器PackNet [160]搭载式(Piggyback)BA2 [17]BAT单独式 [160]
仅 [160][160]批量归一化(Batch Normalization,BN)简单完整
参数数量(#Params)11.111.151.211.171.211.216
图像网(ImageNet)74.474.474.474.474.474.474.474.474.4
加州大学圣地亚哥分校鸟类数据集(CUBS)73.580.769.679.781.482.481.581.781.9
斯坦福汽车数据集(Stanford Cars)56.884.777.987.290.192.991.791.691.4
花卉数据集83.491.191.594.395.596.096.796.996.5
维基艺术数据集(WikiArt)54.966.369.272.073.971.575.575.776.4
草图数据集53.174.778.980.079.179.979.979.880.5
分数32468560794612091434150615341500
分数/参数324617547822999122612451268250


Finally,both Piggyback, BA2 and our model outperform PackNet and,as opposed to the latter method, do not suffer from the heavily dependence on the ordering of the tasks. This advantage stems from having a sequential multi-task learning strategy that is task independent, with the base network not affected by the new tasks that are learned.
最后,Piggyback(背负式学习法)、BA2以及我们的模型均优于PackNet(打包网络),并且与后者不同的是,它们不会严重依赖任务的顺序。这一优势源于采用了一种与任务无关的顺序多任务学习策略,基础网络不会受到新学习任务的影响。

Visual Decathlon Challenge. In this section we report the results obtained on the Visual Decathlon Challenge. We compare our model with the baseline method Piggyback [160] (PB),the budget-aware adapters of [17] (BA 2 ),the improved version of the winner entry of the 2017 edition of the challenge [223] (DAN), the network with task-specific parallel adapters [215] (PA), the task-specific attention modules of [150] (MTAN), the covariance normalization approach [140] (CovNorm) and SpotTune [96]. We additionally report the baselines proposed by the authors of the challenge [214]. For the latter, we report the results of 5 models: the network used as feature extractor (Feature), 10 different models fine-tuned on each single task (Fine-tune), the network with task-specific residual adapter modules [214] (RA), the same model with increased weight decay (RA-decay) and the same architecture jointly trained on all 10 tasks, in a round-robin fashion (RA-joint). The first two models are considered as references. For the parallel adapters approach [215] we report also the version with a post training low-rank decomposition of the adapters (PA-SVD). This approach extracts a task specific and a task agnostic component from the learned adapters with the task specific components which are further fine-tuned on each task. Additionally we report the novel results of the residual adapters [214] as reported in
视觉十项全能挑战赛。在本节中,我们报告了在视觉十项全能挑战赛中取得的结果。我们将我们的模型与基线方法Piggyback [160](PB,背负式学习法)、文献[17]中考虑预算的适配器(BA 2)、2017年该挑战赛获胜参赛作品的改进版本[223](DAN)、带有特定任务并行适配器的网络[215](PA)、文献[150]中的特定任务注意力模块(MTAN)、协方差归一化方法[140](CovNorm)以及SpotTune [96]进行了比较。此外,我们还报告了该挑战赛作者提出的基线方法[214]。对于后者,我们报告了5个模型的结果:用作特征提取器的网络(特征提取网络)、在每个单一任务上进行微调的10个不同模型(微调模型)、带有特定任务残差适配器模块的网络[214](RA)、增加了权重衰减的同一模型(RA - 衰减模型)以及以循环方式在所有10个任务上联合训练的相同架构(RA - 联合模型)。前两个模型被视为参考模型。对于并行适配器方法[215],我们还报告了对适配器进行训练后低秩分解的版本(PA - SVD)。这种方法从学习到的适配器中提取特定任务和与任务无关的组件,其中特定任务组件会在每个任务上进一步微调。此外,我们还报告了文献中所述残差适配器[214]的新结果

Table 3.3. Accuracy of VGG-16 architectures in the ImageNet-to-Sketch scenario.
表3.3. ImageNet到草图场景中VGG - 16架构的准确率。

DatasetClassifier Only [160]PackNet [160]Piggyback [160]BATIndividual [160]
SimpleFull
#Params11.091.161.161.166
ImageNet71.670.770.771.671.671.671.6
CUBS63.577.770.377.877.477.477.4
Stanford Cars45.384.278.386.187.287.387.0
Flowers80.689.789.890.791.691.592.3
WikiArt50.567.268.571.271.671.967.7
Sketch41.571.475.176.576.576.776.4
Score34211529791441153015381500
Score/Params3421057898124313191326250
数据集仅分类器 [160]PackNet [160]背负式(Piggyback) [160]蝙蝠(BAT)单独的 [160]
简单的完整的
参数数量11.091.161.161.166
ImageNet(图像网)71.670.770.771.671.671.671.6
CUBS(加州大学伯克利分校鸟类数据集)63.577.770.377.877.477.477.4
斯坦福汽车数据集45.384.278.386.187.287.387.0
花卉数据集80.689.789.890.791.691.592.3
WikiArt(维基艺术数据集)50.567.268.571.271.671.967.7
草图数据集41.571.475.176.576.576.776.4
分数34211529791441153015381500
分数/参数3421057898124313191326250


[215] (RA-N).
[215] (RA-N)。


Similarly to [223] we tune the training schedule, jointly for the 10 tasks, using the validation set, and evaluate the results obtained on the test set (via the challenge evaluation server) by a model trained on the union of the training and validation sets, using the validated schedule. As opposed to methods like [214] we use the same schedule for the 9 tasks (except for the baseline pre-trained on ImageNet), without adopting task-specific strategies for setting the hyperparameters. Moreover, we do not employ our algorithm while pre-training the ImageNet architecture as in [214]. For fairness, we additionally report the results obtained by our implementation of [160] using the same pre-trained model, training schedule and data augmentation adopted for our algorithm (PB ours).
与文献[223]类似,我们使用验证集为10个任务联合调整训练计划,并通过在训练集和验证集的并集上使用经过验证的计划进行训练的模型,在测试集上(通过挑战赛评估服务器)评估结果。与文献[214]等方法不同,我们对9个任务使用相同的计划(除了在ImageNet上预训练的基线模型),而不采用特定任务的策略来设置超参数。此外,我们不像文献[214]那样在预训练ImageNet架构时使用我们的算法。为了公平起见,我们还报告了使用与我们的算法相同的预训练模型、训练计划和数据增强方法实现文献[160]所得到的结果(PB ours)。

The results are reported in Table 3.4 in terms of the S -score (see,Eq. (3.3)) and Sp . In the first part of the table are shown the baselines (i.e. fine-tuned architectures and using the network as feature extractor) while in the middle the sequential learning models against which we compare. In the last part of the table we report, for fairness, the methods that do not consider a sequential learning setting since they either train on all the datasets jointly (RA-joint) or have a multi-process step considering the all tasks (PA-SVD).
结果在表3.4中以S分数(见公式(3.3))和Sp进行报告。表的第一部分展示了基线模型(即微调架构并将网络用作特征提取器),中间部分是我们与之比较的顺序学习模型。为了公平起见,表的最后部分报告了不考虑顺序学习设置的方法,因为它们要么在所有数据集上联合训练(RA - joint),要么有一个考虑所有任务的多进程步骤(PA - SVD)。

From the table we can see that the full form of our model (F) achieves very high results,being the third best performing method in terms of S -score,behind only CovNorm and SpotTune and being comparable to PA. However, SpotTune uses a large amount of parameters(11x)and PA doubles the parameters of the original model. CovNorm uses a very low number of parameters, but requires a two-stage pipeline. On the other hand, our model does not require neither a large number of parameters (such as SpotTune and PA) nor a two-stage pipeline (as CovNorm) while achieving results close to the state of the art (215 points below CovNorm in terms of S -score). Compared to binary mask based approaches,our model surpasses PiggyBack of more than 600 points, BA2 of 300 and BAT simple of more than 200 . It is worth highlighting that these results have been achieved without task-specific hyperparameter tuning, differently from previous works e.g. [214, 215, 140].
从表中我们可以看到,我们模型的完整形式(F)取得了非常高的结果,在S分数方面是第三好的方法,仅次于CovNorm和SpotTune,并且与PA相当。然而,SpotTune使用了大量的参数(是原模型的11倍),而PA使原模型的参数翻倍。CovNorm使用的参数数量非常少,但需要一个两阶段的流程。另一方面,我们的模型既不需要大量的参数(如SpotTune和PA),也不需要两阶段的流程(如CovNorm),同时取得了接近当前最优水平的结果(在S分数方面比CovNorm低215分)。与基于二进制掩码的方法相比,我们的模型在BA2分数上比PiggyBack高出600多分,比BAT simple高出200多分。值得强调的是,与之前的工作(如文献[214, 215, 140])不同,这些结果是在没有进行特定任务超参数调整的情况下取得的。

Analyzing the Sp score,BAT is the third best performing model,behind BA2 and CovNorm. We highlight however that CovNorm requires a two-stage pipeline to reduce the amount of parameters needed,while BA2 is explicitly designed with the purpose of limiting the budget (i.e. parameters, flops) required by the model.
分析Sp分数,BAT是第三好的模型,仅次于BA2和CovNorm。然而,我们要强调的是,CovNorm需要一个两阶段的流程来减少所需的参数数量,而BA2是专门为限制模型所需的预算(即参数、浮点运算次数)而设计的。

Table 3.4. Results in terms of S and Sp scores for the Visual Decathlon Challenge.
表3.4. 视觉十项全能挑战赛在SSp分数方面的结果。

Method#ParImNAircC100DPDTDGTSFlwOglSVHNUCFScoreSp
Feature [214]159.723.363.180.345.468.273.758.843.526.8544544
Fine-tune [214]1059.960.382.192.855.597.581.487.796.651.22500250
RA [214]259.756.781.293.950.997.166.289.696.147.521181059
RA-decay[214]259.761.981.293.957.197.681.789.696.150.126211311
RA-N [215]260.361.981.293.957.199.381.789.696.650.131591580
DAN [223]2.1757.764.180.191.356.598.586.189.796.849.428521314
PA [215]260.364.281.994.758.899.484.789.296.550.934121706
MTAN [150]1.7463.961.881.691.656.498.881.089.896.950.629411690
SpotTune [96]1160.363.980.596.557.199.585.288.896.752.33612328
CovNorm [140]1.2560.469.481.398.860.099.183.487.796.648.937132970
PB [160]1.2857.765.379.997.057.597.379.187.697.247.528382217
PB ours1.2860.852.380.095.159.698.782.985.196.746.928052191
BA2[17]1.0356.949.478.195.555.199.486.188.796.950.231993106
BAT (S) [171]1.2960.851.381.994.759.099.188.089.396.548.732632529
BAT (F)1.2960.852.882.096.258.799.288.289.296.848.634972711
PA-SVD[215]1.560.366.081.994.257.899.285.789.396.652.533982265
RA-joint[214]259.263.781.393.357.097.583.489.896.250.326431322
方法#段落虚部(ImN)气室(Airc)C100动态规划(DP)文档类型定义(DTD)全局时间同步(GTS)流量(Flw)油藏地质力学(Ogl)街景门牌号数据集(SVHN)佛罗里达大学数据集(UCF)分数Sp
特征 [214]159.723.363.180.345.468.273.758.843.526.8544544
微调 [214]1059.960.382.192.855.597.581.487.796.651.22500250
随机调整(RA) [214]259.756.781.293.950.997.166.289.696.147.521181059
随机调整衰减(RA - decay)[214]259.761.981.293.957.197.681.789.696.150.126211311
随机调整 - N(RA - N) [215]260.361.981.293.957.199.381.789.696.650.131591580
深度对抗网络(DAN) [223]2.1757.764.180.191.356.598.586.189.796.849.428521314
投影调整(PA) [215]260.364.281.994.758.899.484.789.296.550.934121706
多任务注意力网络(MTAN) [150]1.7463.961.881.691.656.498.881.089.896.950.629411690
局部微调(SpotTune) [96]1160.363.980.596.557.199.585.288.896.752.33612328
协方差归一化(CovNorm) [140]1.2560.469.481.398.860.099.183.487.796.648.937132970
投影块(PB) [160]1.2857.765.379.997.057.597.379.187.697.247.528382217
我们的投影块(PB ours)1.2860.852.380.095.159.698.782.985.196.746.928052191
BA2[17]1.0356.949.478.195.555.199.486.188.796.950.231993106
批量对抗训练(S)(BAT (S)) [171]1.2960.851.381.994.759.099.188.089.396.548.732632529
批量对抗训练(F)(BAT (F))1.2960.852.882.096.258.799.288.289.296.848.634972711
投影调整 - 奇异值分解(PA - SVD)[215]1.560.366.081.994.257.899.285.789.396.652.533982265
随机调整 - 联合(RA - joint)[214]259.263.781.393.357.097.583.489.896.250.326431322


Ablation Study
消融研究


In the following we will analyze the impact of the various components of our model. In particular we consider the impact of the parameters k0,k1,k2,k3 and the surrogate function h~ on the final results of our model for the ResNet-50 and DenseNet-121 architectures in the ImageNet-to-Sketch scenario. Since the architectures contains batch-normalization layers,we set k0=1 for our simple and full versions and k0=0 when we analyze the special case [160]. For the other parameters we adopt various choices: either we fix them to a constant in order not take into account their impact, or we train them, to assess their particular contribution to the model. The surrogate function we use is the identity function h~(x)=x ,unless otherwise stated (i.e. with Sigmoid). The results of our analysis are shown in Tables 3.5 and 3.6.
接下来,我们将分析模型各组件的影响。具体而言,我们考虑参数 k0,k1,k2,k3 和代理函数 h~ 在 ImageNet 到草图场景下对 ResNet - 50 和 DenseNet - 121 架构模型最终结果的影响。由于这些架构包含批量归一化层,我们为简单版本和完整版本设置 k0=1,并在分析特殊情况 [160] 时设置 k0=0。对于其他参数,我们采用多种选择:要么将它们固定为常数,以便不考虑其影响;要么对其进行训练,以评估它们对模型的特定贡献。除非另有说明(即使用 Sigmoid 函数),我们使用的代理函数是恒等函数 h~(x)=x。我们的分析结果如表 3.5 和表 3.6 所示。

As the Tables shows, while the BN parameters allow a boost in the performances of Piggyback,adding k1 to the model does not provide further gain in performances. This does not happen for the simple version of our model: without k1 our model is not able to fully exploit the presence of the binary masks, achieving comparable or even lower performances with respect to the Piggyback model. We also notice that a similar drop affecting our Simple version when bias was omitted.
如表所示,虽然批量归一化(BN)参数能提升 Piggyback 的性能,但向模型中添加 k1 并不会进一步提升性能。我们模型的简单版本并非如此:没有 k1,我们的模型无法充分利用二元掩码,与 Piggyback 模型相比,性能相当甚至更低。我们还注意到,当省略偏置时,简单版本也会出现类似的性能下降。

Noticeable,the full versions with k2=0 suffer a large decrease in performances in almost all settings (e.g. ResNet-50 Flowers from 96.7% to 91.0%), showing that the component that brings the largest benefits to our algorithm is the addition of the binary mask itself scaled by k2 (i.e. k2M ). This explains also the reason why the simple version achieves a performance similar to the full version of our model. We finally note that there is a limited contribution brought by the standard Piggyback component (i.e. k1WM ),compared to the new components that we have introduced in the transformation: in fact, there is a clear drop in performance in various scenarios (e.g. CUBS,Cars) when we set either k1=0 or k2=0 ,thus highlighting the importance of those components. Consequently,as k1 is introduced in our Simple model, the boost of performances is significant such that neither the inclusion of k3 ,nor considering channel-wise parameters k1 provide further gains. Slightly better results are achieved in a larger datasets, such as WikiArt, with the additional parameters giving more capacity to the model, thus better handling the larger amount of information available in the dataset.
值得注意的是,带有 k2=0 的完整版本在几乎所有设置下性能都会大幅下降(例如,ResNet - 50 在花卉数据集上的准确率从 96.7% 降至 91.0%),这表明对我们的算法带来最大益处的组件是由 k2 缩放后的二元掩码本身(即 k2M)。这也解释了为什么简单版本的性能与完整版本相似。最后,我们注意到,与我们在变换中引入的新组件相比,标准 Piggyback 组件(即 k1WM)的贡献有限:实际上,当我们设置 k1=0k2=0 时,在各种场景(如 CUBS、汽车数据集)中性能会明显下降,从而凸显了这些组件的重要性。因此,当在我们的简单模型中引入 k1 时,性能提升显著,以至于无论是包含 k3,还是考虑逐通道参数 k1,都无法进一步提升性能。在更大的数据集(如 WikiArt)上能取得稍好的结果,额外的参数为模型提供了更大的容量,从而能更好地处理数据集中的大量信息。

Table 3.5. Impact of the parameters k0,k1,k2 and k3 of our model using the ResNet-50 architectures in the ImageNet-to-Sketch scenario. denotes a learned parameter,while denotes [160] obtained as a special case of our model.
表 3.5. 在 ImageNet 到草图场景下,使用 ResNet - 50 架构时,我们模型的参数 k0,k1,k2k3 的影响。 表示学习得到的参数,而 表示作为我们模型特殊情况得到的 [160]。

Methodk0k1k2k3CUBSCARSFlowersWikiArtSketch
Piggyback [160]000180.488.193.673.479.4
Piggyback*000180.487.893.172.578.6
Piggyback* with BN000182.190.695.274.179.4
Piggyback* with BN00181.989.994.873.779.9
BAT (Simple, no bias)10080.890.396.173.580.0
BAT (Simple) [171]1082.691.596.574.880.2
BAT (Simple with Sigmoid)1082.691.496.475.280.2
BAT (Full, no bias)1080.790.296.072.078.8
BAT (Full,no k2 )1080.687.591.073.078.4
BAT (Full)182.491.496.775.380.2
BAT (Full with Sigmoid)182.791.496.675.280.2
BAT (Full, channel-wise)182.091.096.374.880.0
方法k0k1k2k3加州大学圣地亚哥分校鸟类数据集(CUBS)斯坦福汽车数据集(CARS)花卉数据集维基艺术数据集(WikiArt)草图数据集
搭载式方法(Piggyback) [160]000180.488.193.673.479.4
改进的搭载式方法(Piggyback*)000180.487.893.172.578.6
带批量归一化的改进搭载式方法(Piggyback* with BN)000182.190.695.274.179.4
带批量归一化的改进搭载式方法(Piggyback* with BN)00181.989.994.873.779.9
批量自适应训练(简单版,无偏置)(BAT (Simple, no bias))10080.890.396.173.580.0
批量自适应训练(简单版) [171](BAT (Simple) [171])1082.691.596.574.880.2
带Sigmoid激活函数的批量自适应训练(简单版)(BAT (Simple with Sigmoid))1082.691.496.475.280.2
批量自适应训练(完整版,无偏置)(BAT (Full, no bias))1080.790.296.072.078.8
批量自适应训练(完整版,无 k2 )(BAT (Full,no k2 ))1080.687.591.073.078.4
批量自适应训练(完整版)(BAT (Full))182.491.496.775.380.2
带Sigmoid激活函数的批量自适应训练(完整版)(BAT (Full with Sigmoid))182.791.496.675.280.2
按通道的批量自适应训练(完整版)(BAT (Full, channel-wise))182.091.096.374.880.0

Table 3.6. Impact of the parameters k0,k1,k2 and k3 of our model using the DenseNet-121 architectures in the ImageNet-to-Sketch scenario. denotes a learned parameter,while denotes [160] obtained as a special case of our model.
表3.6. 在ImageNet到草图(ImageNet-to-Sketch)场景中,使用DenseNet - 121架构时,我们模型的参数k0,k1,k2k3的影响。表示一个学习得到的参数,而表示作为我们模型的一个特殊情况得到的[160]。

Methodk0k1k2k3CUBSCARSFlowersWikiArtSketch
Piggyback [160]000179.787.294.372.080.0
Piggyback*000180.086.694.471.978.7
Piggyback* with BN000181.490.195.573.979.1
Piggyback* with BN00181.990.195.472.679.9
BAT (Simple, no bias)10080.491.496.775.079.7
BAT (Simple) [171]1081.591.796.775.579.9
BAT (Simple with Sigmoid)1081.591.797.076.079.8
BAT (Full, no bias)1080.291.196.575.179.2
BAT (Full,no k2 )1079.887.291.873.278.1
BAT (Full) [172]181.791.696.975.779.9
BAT (Full with Sigmoid)182.091.797.076.079.9
BAT (Full, channel-wise)181.491.696.575.579.9
方法k0k1k2k3加州大学圣地亚哥分校鸟类数据集(CUBS)斯坦福汽车数据集(CARS)花卉数据集维基艺术数据集(WikiArt)草图数据集
搭载式方法(Piggyback) [160]000179.787.294.372.080.0
改进的搭载式方法(Piggyback*)000180.086.694.471.978.7
带批量归一化的改进搭载式方法(Piggyback* with BN)000181.490.195.573.979.1
带批量归一化的改进搭载式方法(Piggyback* with BN)00181.990.195.472.679.9
批量自适应训练(简单版,无偏置)(BAT (Simple, no bias))10080.491.496.775.079.7
批量自适应训练(简单版) [171](BAT (Simple) [171])1081.591.796.775.579.9
带Sigmoid激活函数的批量自适应训练(简单版)(BAT (Simple with Sigmoid))1081.591.797.076.079.8
批量自适应训练(完整版,无偏置)(BAT (Full, no bias))1080.291.196.575.179.2
批量自适应训练(完整版,无 k2 )(BAT (Full,no k2 ))1079.887.291.873.278.1
批量自适应训练(完整版) [172](BAT (Full) [172])181.791.696.975.779.9
带Sigmoid激活函数的批量自适应训练(完整版)(BAT (Full with Sigmoid))182.091.797.076.079.9
按通道的批量自适应训练(完整版)(BAT (Full, channel-wise))181.491.696.575.579.9


As to what concerns the choice of the surrogate h~ ,no particular advantage has been noted when h~(x)=σ(x) with respect to the standard straight-through estimator (h~(x)=x) . This may be caused by the noisy nature of the straight-through estimator, which has the positive effect of regularizing the parameters, as shown in previous works [16, 188].
至于替代估计量 h~ 的选择,与标准直通估计量 (h~(x)=x) 相比,并未发现有什么特别的优势。这可能是由于直通估计量的噪声特性所致,正如先前的研究 [16, 188] 所示,这种特性对参数有正则化的积极作用。

We also note that for DenseNet-121,as opposed to ResNet-50,setting k1 to zero degrades the performance only in 1 out of 5 datasets (i.e. CUBS) while the other 4 are not affected, showing that the effectiveness of different components of the model is also dependent on the architecture used.
我们还注意到,对于 DenseNet - 121,与 ResNet - 50 不同,将 k1 设置为零仅在 5 个数据集中的 1 个(即 CUBS 数据集)中导致性能下降,而其他 4 个数据集不受影响,这表明模型不同组件的有效性也取决于所使用的架构。
Parameter Analysis We analyze the values of the parameters k1,k2 and k3 of one instance of our full model in the ImageNet-to-Sketch benchmark. We use both the architectures employed in that scenario (i.e. ResNet-50 and DenseNet-121) and we plot the values of k1,k2 and k3 as well as the percentage of 1 s present inside the binary masks for different layers of the architectures. Together with those values we report the percentage of 1s for the masks obtained through our implementation of Piggyback. Both the models have been trained considering task-specific batch-normalization parameters. The results are shown in Figures 3.3 and 3.4. In all scenarios our model keeps almost half of the masks active across the whole architecture. Compared to the masks obtained by Piggyback, there are 2 differences: 1) Piggyback exhibits denser masks (i.e. with a larger portion of 1s), 2) the density of the masks in Piggyback tends to decreases as the depth of the layer increases. Both these aspects may be linked to the nature of our model: by having more flexibility through the affine transformation adopted, there is less need to keep active large part of the network, since a loss of information can be recovered through the other components of the model, as well as constraining a particular part of the architecture. For what concerns the value of the parameters k1,k2 and k3 for both architectures k2 and k3 tend to have larger magnitudes with respect to k1 . Also,the values of k2 and k1 tend to have a different sign,which allows the term k11+k2M to span over positive and negative values. We also notice that the transformation of the weights are more prominent as the depth increases, which is intuitively explained by the fact that baseline network requires stronger adaptation to represent the higher-level concepts pertaining to different tasks. This is even more evident for WikiArt and Sketch due to the variability that these datasets contain with respect to standard natural images.
参数分析 我们在 ImageNet - to - Sketch 基准测试中分析了我们完整模型的一个实例的参数 k1,k2k3 的值。我们使用了该场景中采用的两种架构(即 ResNet - 50 和 DenseNet - 121),并绘制了 k1,k2k3 的值,以及架构不同层的二值掩码中 1 的百分比。除了这些值,我们还报告了通过我们实现的 Piggyback 方法获得的掩码中 1s 的百分比。两个模型在训练时都考虑了特定任务的批量归一化参数。结果如图 3.3 和 3.4 所示。在所有场景中,我们的模型在整个架构中几乎保持一半的掩码处于激活状态。与通过 Piggyback 方法获得的掩码相比,有两个不同之处:1) Piggyback 方法得到的掩码更密集(即 1 的比例更大);2) Piggyback 方法中掩码的密度随着层深度的增加而趋于降低。这两个方面可能都与我们模型的性质有关:通过采用仿射变换获得了更大的灵活性,因此不需要让网络的大部分保持激活状态,因为信息的损失可以通过模型的其他组件以及对架构特定部分的约束来恢复。对于这两种架构的参数 k1,k2k3 的值,k2k3 相对于 k1 往往具有更大的量级。此外,k2k1 的值往往具有不同的符号,这使得项 k11+k2M 可以取正值和负值。我们还注意到,随着层深度的增加,权重的变换更加显著,这可以直观地解释为,基线网络需要更强的适应性来表示与不同任务相关的更高级概念。由于 WikiArt 和 Sketch 数据集相对于标准自然图像具有更大的可变性,这一点在这些数据集上表现得更为明显。

3.3.5 Conclusions
3.3.5 结论


This section presented a simple yet powerful method for sequentially learning new tasks, given a fixed, pre-trained deep architecture. In particular, we generalize previous works on multi-domain learning applying binary masks to the original weights of the network [160] by introducing an affine transformation that acts upon such weights and the masks themselves. Our generalization allows implementing a large variety of possible transformations, better adapting to the specific characteristics of each task. These advantages are shown experimentally on two public benchmarks fully confirm the power of our approach which fills the gap between the binary-mask based and state-of-the-art methods on the Visual Decathlon Challenge.
本节介绍了一种简单而强大的方法,用于在给定固定的预训练深度架构的情况下顺序学习新任务。具体而言,我们通过引入一种作用于网络原始权重和掩码本身的仿射变换,对先前在多领域学习中对网络原始权重应用二值掩码的工作 [160] 进行了推广。我们的推广允许实现多种可能的变换,从而更好地适应每个任务的特定特征。在两个公开基准测试上的实验结果充分证实了我们方法的强大之处,该方法填补了基于二值掩码的方法与视觉十项全能挑战赛中的最先进方法之间的差距。

Interesting future directions are extending this approach to several life-long learning scenarios (from incremental class learning to open-world recognition) and exploiting the relationship between different task through cross-task affine transformations, in order to reuse knowledge obtained from different tasks by the model.
有趣的未来研究方向包括将这种方法扩展到多种终身学习场景(从增量类学习到开放世界识别),并通过跨任务仿射变换挖掘不同任务之间的关系,以便模型能够复用从不同任务中获得的知识。

While in this section we considered multi-domain learning, an inherently multihead problem, single-head incremental learning scenarios are considered more challenging in the community, due to the more severe presence of the catastrophic forgetting problem [36]. In the next section, we will study the problem of incremental class learning in semantic segmentation, a single-head problem mostly unexplored in the community.
虽然在本节中我们考虑了多领域学习(这本质上是一个多头问题),但在学术界,单头增量学习场景被认为更具挑战性,因为灾难性遗忘问题更为严重 [36]。在下一节中,我们将研究语义分割中的增量类学习问题,这是一个学术界基本未探索过的单头问题。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_116.jpg?x=259&y=575&w=1133&h=986&r=0

Figure 3.3. Percentage of 1s in the binary masks at different layers depth for Piggyback (left) and our full model (center) and values of the parameters k1,k2,k3 computed by our full model (right) for all datasets of the Imagenet-to-Sketch benchmark and the ResNet-50 architecture.
图3.3。在ImageNet到草图基准测试的所有数据集和ResNet - 50架构下,搭载式(Piggyback,左)和我们的完整模型(中)在不同层深度的二值掩码中1s的百分比,以及我们的完整模型计算的参数k1,k2,k3的值(右)。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_117.jpg?x=259&y=575&w=1133&h=986&r=0

Figure 3.4. Percentage of 1s in the binary masks at different layers depth for Piggyback (left) and our full BAT model (center,ours) and values of the parameters k1,k2,k3 computed by our full model (right) for all datasets of the Imagenet-to-Sketch benchmark with the DenseNet-121 architecture.
图3.4。在ImageNet到草图基准测试的所有数据集和DenseNet - 121架构下,搭载式(Piggyback,左)和我们的完整BAT模型(中,我们的)在不同层深度的二值掩码中1s的百分比,以及我们的完整模型计算的参数k1,k2,k3的值(右)。


3.4 Incremental Learning in Semantic Segmentation 6
3.4 语义分割中的增量学习 6


In Section 3.3, we focused on the problem of multi-domain learning, where the goal is to equip a model to tackle multiple tasks at the same time. Both in our BAT approach and previous works [214,215,223,140] this is achieved by learning task-specific parameters which are included in the original pre-trained model. This kind of scenario falls into the multi-head incremental learning setting (i.e. one network per task/set of concepts) and it is considered to be an easier problem than the single-head counterpart [36]. In the single-head scenario, we have a unique model classifying all semantic concepts together and, since all concepts share the same output space, this makes the catastrophic forgetting problem more severe. In this section, we will describe a solution to a classical single-head scenario, i.e. incremental class learning, for an unexplored task: semantic segmentation.
在3.3节中,我们关注多领域学习问题,其目标是使模型能够同时处理多个任务。在我们的BAT方法和先前的工作[214,215,223,140]中,这都是通过学习特定于任务的参数来实现的,这些参数被包含在原始的预训练模型中。这种场景属于多头增量学习设置(即每个任务/概念集对应一个网络),并且被认为比单头增量学习问题更容易[36]。在单头场景中,我们有一个唯一的模型来共同对所有语义概念进行分类,并且由于所有概念共享相同的输出空间,这使得灾难性遗忘问题更加严重。在本节中,我们将描述一个针对未探索任务——语义分割的经典单头场景(即增量类别学习)的解决方案。

Semantic segmentation is a fundamental problem in computer vision. In the last years, thanks to the emergence of deep neural networks and to the availability of large-scale human-annotated datasets [64,309] ,the state of the art has improved significantly [152,40,307,146,305] . Current approaches are derived by extending deep architectures from image-level to pixel-level classification, taking advantage of Fully Convolutional Networks (FCNs) [152]. Over the years, semantic segmentation models based on FCNs have been improved in several ways, e.g. by exploiting multiscale representations [146,305] ,modeling spatial dependencies and contextual cues [38,37,40] or considering attention models [39].
语义分割是计算机视觉中的一个基本问题。近年来,由于深度神经网络的出现以及大规模人工标注数据集的可用性[64,309],当前的技术水平有了显著提高[152,40,307,146,305]。当前的方法是通过将深度架构从图像级分类扩展到像素级分类而得到的,利用了全卷积网络(Fully Convolutional Networks,FCNs)[152]。多年来,基于FCNs的语义分割模型在多个方面得到了改进,例如通过利用多尺度表示[146,305]、建模空间依赖和上下文线索[38,37,40]或考虑注意力模型[39]。

Still, existing semantic segmentation methods are not designed to incrementally update their inner classification model when new categories are discovered. While deep nets are undoubtedly powerful, it is well known that their capabilities in an incremental learning setting are limited [114]. In fact, deep architectures struggle in updating their parameters for learning new categories whilst preserving good performance on the old ones (catastrophic forgetting [175]).
尽管如此,现有的语义分割方法并非为在发现新类别时增量更新其内部分类模型而设计。虽然深度网络无疑很强大,但众所周知,它们在增量学习设置中的能力是有限的[114]。事实上,深度架构在更新其参数以学习新类别时,难以在旧类别上保持良好的性能(灾难性遗忘[175])。

As described in Section 3.2, the problem of incremental learning has been traditionally addressed in object recognition [144,118,36,216,106] and detection [240], but less attention has been devoted to semantic segmentation. Here we fill this gap, proposing an incremental class learning (ICL) approach for semantic segmentation. Inspired by previous methods on image classification [144, 216, 30], we cope with catastrophic forgetting by resorting to knowledge distillation [102]. However, we argue (and experimentally demonstrate) that a naive application of previous knowledge distillation strategies would not suffice in this setting. In fact, one peculiar aspect of semantic segmentation is the presence of a special class, the background class, indicating pixels not assigned to any of the given object categories. While the presence of this class marginally influences the design of traditional, offline semantic segmentation methods, this is not true in an incremental learning setting. As illustrated in Fig. 3.5, it is reasonable to assume that the semantics associated to the background class changes over time. In other words, pixels associated to the background during a learning step may be assigned to a specific object class in subsequent steps or vice-versa, with the effect of exacerbating the catastrophic forgetting. To overcome this issue, we revisit the classical distillation-based framework for incremental learning [144] by introducing two novel loss terms to properly account for the semantic distribution shift within the background class, thus introducing the first ICL approach tailored to semantic segmentation. We name this method as MiB (Modeling the Background for incremental learning in semantic segmentation). We extensively evaluate MiB on two datasets, Pascal-VOC [64] and ADE20K [309], showing that our approach, coupled with a novel classifier initialization strategy, largely outperform traditional ICL methods.
如3.2节所述,增量学习问题在目标识别[144,118,36,216,106]和检测[240]领域传统上已得到解决,但在语义分割领域受到的关注较少。在此,我们填补了这一空白,提出了一种用于语义分割的增量类别学习(Incremental Class Learning,ICL)方法。受先前图像分类方法[144, 216, 30]的启发,我们借助知识蒸馏[102]来应对灾难性遗忘问题。然而,我们认为(并通过实验证明),简单应用先前的知识蒸馏策略在这种情况下是不够的。事实上,语义分割的一个独特之处在于存在一个特殊类别,即背景类别,它表示未被分配到任何给定目标类别的像素。虽然这个类别的存在对传统离线语义分割方法的设计影响不大,但在增量学习环境中并非如此。如图3.5所示,合理假设与背景类别相关的语义会随时间变化。换句话说,在某个学习步骤中与背景相关联的像素可能会在后续步骤中被分配到特定的目标类别,反之亦然,这会加剧灾难性遗忘问题。为克服这一问题,我们重新审视了基于蒸馏的经典增量学习框架[144],引入了两个新的损失项,以适当考虑背景类别内的语义分布变化,从而提出了第一种专门针对语义分割的ICL方法。我们将这种方法命名为MiB(用于语义分割增量学习的背景建模,Modeling the Background for incremental learning in semantic segmentation)。我们在两个数据集Pascal - VOC [64]和ADE20K [309]上对MiB进行了广泛评估,结果表明,我们的方法结合一种新颖的分类器初始化策略,在性能上大大优于传统的ICL方法。


6 F. Cermelli,M. Mancini,E. Ricci,B. Caputo. Modeling the Background for Incremental Learning in Semantic Segmentation. IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR) 2020.
6 F. Cermelli,M. Mancini,E. Ricci,B. Caputo. 用于语义分割增量学习的背景建模. 电气与电子工程师协会/计算机视觉基金会国际计算机视觉与模式识别会议(IEEE/CVF International Conference on Computer Vision and Pattern Recognition,CVPR)2020.




https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_119.jpg?x=266&y=254&w=1117&h=571&r=0

Figure 3.5. Illustration of the semantic shift of the background class in incremental learning for semantic segmentation. Yellow boxes denote the ground truth provided in the learning step, while grey boxes denote classes not labeled. As different learning steps have different label spaces,at step t old classes (e.g. person) and unseen ones (e.g. car) might be labeled as background in the current ground truth. Here we show the specific case of single class learning steps, but we address the general case where an arbitrary number of classes is added.
图3.5. 语义分割增量学习中背景类别语义变化的示意图。黄色框表示学习步骤中提供的真实标签,而灰色框表示未标记的类别。由于不同的学习步骤具有不同的标签空间,在步骤t中,旧类别(例如人)和未见过的类别(例如汽车)可能在当前的真实标签中被标记为背景。这里我们展示了单类别学习步骤的具体情况,但我们处理的是添加任意数量类别的一般情况。


To summarize, the contributions described in this section are as follows:
综上所述,本节所述的贡献如下:

  • We study the task of incremental class learning for semantic segmentation, analyzing in particular the problem of distribution shift arising due to the presence of the background class.
  • 我们研究了语义分割的增量类别学习任务,特别分析了由于背景类别的存在而产生的分布变化问题。

  • We propose a new objective function and introduce a specific classifier initialization strategy to explicitly cope with the evolving semantics of the background class. We show that our approach greatly alleviates the catastrophic forgetting, leading to the state of the art.
  • 我们提出了一个新的目标函数,并引入了一种特定的分类器初始化策略,以明确应对背景类别的语义演变。我们表明,我们的方法大大缓解了灾难性遗忘问题,达到了当前的最优水平。

  • We benchmark MiB over several previous ICL methods on two popular semantic segmentation datasets, considering different experimental settings. We hope that our results will serve as a reference for future works.
  • 我们在两个流行的语义分割数据集上,针对几种先前的ICL方法对MiB进行了基准测试,考虑了不同的实验设置。我们希望我们的结果能为未来的研究提供参考。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_120.jpg?x=271&y=263&w=1107&h=385&r=0

Figure 3.6. Overview of MiB . At learning step t an image is processed by the old (top) and current (bottom) models, mapping the image to their respective output spaces. As in standard ICL methods, we apply a cross-entropy loss to learn new classes (blue block) and a distillation loss to preserve old knowledge (yellow block). In this framework, we model the semantic changes of the background across different learning steps by (i) initializing the new classifier using the weights of the old background one (left), (ii) comparing the pixel-level background ground truth in the cross-entropy with the probability of having either the background (black) or an old class (pink and grey bars) and (iii) relating the background probability given by the old model in the distillation loss with the probability of having either the background or a novel class (green bar).
图3.6. MiB概述。在学习步骤t,图像由旧模型(顶部)和当前模型(底部)进行处理,将图像映射到各自的输出空间。与标准的ICL方法一样,我们应用交叉熵损失来学习新类别(蓝色块),并应用蒸馏损失来保留旧知识(黄色块)。在这个框架中,我们通过以下方式对不同学习步骤中背景的语义变化进行建模:(i)使用旧背景分类器的权重初始化新分类器(左侧);(ii)将交叉熵中的像素级背景真实标签与具有背景(黑色)或旧类别(粉色和灰色条)的概率进行比较;(iii)将蒸馏损失中旧模型给出的背景概率与具有背景或新类别(绿色条)的概率相关联。


3.4.1 Problem Formulation
3.4.1 问题表述


Before delving into the details of ICL for semantic segmentation, we first introduce the task of semantic segmentation. Let us denote as X the input space (i.e. the image space) and,without loss of generality,let us assume that each image xX is composed by a set of pixels I with constant cardinality |I|=N . The output space is defined as YN ,with the latter denoting the product set of N -tuples with elements in a label space Y . Given an image x the goal of semantic segmentation is to assign each pixel xi of image x a label yiY ,representing its semantic class. Out-of-class pixels can be assigned a special class,i.e. the background class bY . Given a training set TX×YN ,the mapping is realized by learning a model fθ with parameters θ from the image space X to a pixel-wise class probability vector,i.e. fθ:XRN×|Y| . The output segmentation mask is obtained as y={argmaxcYfθ(x)[i,c]}i=1N ,where fθ(x)[i,c] is the probability for class c in pixel i .
在深入探讨用于语义分割的增量式类学习(Incremental Class Learning,ICL)的细节之前,我们首先介绍语义分割任务。我们将输入空间(即图像空间)表示为X,并且不失一般性地,假设每个图像xX由一组像素I组成,这些像素的基数为常数|I|=N。输出空间定义为YN,其中后者表示元素来自标签空间YN元组的乘积集。给定一幅图像x,语义分割的目标是为图像x的每个像素xi分配一个标签yiY,以表示其语义类别。类外像素可以被分配一个特殊类别,即背景类别bY。给定一个训练集TX×YN,通过学习一个从图像空间X到逐像素类别概率向量的模型fθ(其参数为θ)来实现这种映射,即fθ:XRN×|Y|。输出的分割掩码通过y={argmaxcYfθ(x)[i,c]}i=1N获得,其中fθ(x)[i,c]是像素i属于类别c的概率。

In the ICL setting, training is realized over multiple phases, called learning steps, and each step introduces novel categories to be learnt. In other terms, during the tth  learning step,the previous label set Yt1 is expanded with a set of new classes Ct ,yielding a new label set Yt=Yt1Ct . Following the notation in Section 3.1,at learning step t we are also provided with a training set TtX×(Ct)N that is used in conjunction to the previous model fθt1:XRN×|Yt1| to train an updated model fθt:XRN×|Yt| . As in standard ICL,we assume the sets of labels Ct that we obtain at the different learning steps to be disjoint, except for the special void/background class b.
在增量式类学习(ICL)设置中,训练是分多个阶段进行的,这些阶段称为学习步骤,并且每个步骤都会引入需要学习的新类别。换句话说,在第tth 个学习步骤中,先前的标签集Yt1会通过一组新类别Ct进行扩展,从而得到一个新的标签集Yt=Yt1Ct。遵循3.1节中的符号表示,在学习步骤t,我们还会得到一个训练集TtX×(Ct)N,该训练集与先前的模型fθt1:XRN×|Yt1|一起用于训练一个更新后的模型fθt:XRN×|Yt|。与标准的增量式类学习(ICL)一样,我们假设在不同学习步骤中获得的标签集Ct是不相交的,除了特殊的空/背景类别b。

3.4.2 Modeling the Background for Incremental Learning in Se- mantic Segmentation
3.4.2 为语义分割中的增量学习对背景进行建模


A naive approach to address the ICL problem consists in retraining the model fθt on each set Tt sequentially. When the predictor fθt is realized through a deep architecture, this corresponds to fine-tuning the network parameters on the training set Tt initialized with the parameters θt1 from the previous stage. This approach is simple,but it leads to catastrophic forgetting. Indeed,when training using Tt no samples from the previously seen object classes are provided. This biases the new predictor fθt towards the novel set of categories in Ct to the detriment of the classes from the previous sets. In the context of ICL for image-level classification, a standard way to address this issue is coupling the supervised loss on Tt with a regularization term, either taking into account the importance of each parameter for previous tasks [118,239] ,or by distilling the knowledge using the predictions of the old model fθt1[144,216,30] . We take inspiration from the latter solution to initialize the overall objective function of our problem. In particular, we minimize a loss function of the form:
解决增量式类别学习(ICL)问题的一种简单方法是依次在每个数据集Tt上重新训练模型fθt。当预测器fθt通过深度架构实现时,这相当于在训练集Tt上微调网络参数,初始参数θt1来自前一阶段。这种方法很简单,但会导致灾难性遗忘。实际上,在使用Tt进行训练时,不会提供之前见过的对象类别的样本。这会使新的预测器fθt偏向于Ct中的新类别集,从而损害前几个数据集中类别的性能。在图像级分类的增量式类别学习背景下,解决这个问题的一种标准方法是将Tt上的监督损失与一个正则化项相结合,要么考虑每个参数对先前任务的重要性[118,239],要么通过利用旧模型的预测结果fθt1[144,216,30]来提炼知识。我们从后一种解决方案中获得灵感,来初始化我们问题的总体目标函数。具体来说,我们最小化以下形式的损失函数:

(3.4)L(θt)=1|Tt|(x,y)Tt(ceθt(x,y)+λkdθt(x))

where ce is a standard supervised loss (e.g. cross-entropy loss), kd is the distillation loss and λ>0 is a hyper-parameter balancing the importance of the two terms.
其中ce是标准的监督损失(例如交叉熵损失),kd是提炼损失,λ>0是平衡这两项重要性的超参数。

As stated in Sec. 3.4.1, differently from standard ICL settings considered for image classification problems, in semantic segmentation we have that two different label sets Cs and Cu share the common void/background class b . However,the distribution of the background class changes across different incremental steps. In fact,background annotations given in Tt refer to classes not present in Ct ,that might belong to the set of seen classes Yt1 and/or to still unseen classes i.e. Cu with u>t (see Fig. 3.5). In the following,we show how we account for the semantic shift in the distribution of the background class by revisiting standard choices for the general objective defined in Eq. (3.4).
如3.4.1节所述,与图像分类问题中考虑的标准增量式类别学习设置不同,在语义分割中,两个不同的标签集CsCu共享共同的空/背景类b。然而,背景类的分布在不同的增量步骤中会发生变化。事实上,Tt中给出的背景注释指的是Ct中不存在的类,这些类可能属于已见过的类集Yt1和/或仍未见过的类,即Cuu>t(见图3.5)。在下面,我们将展示如何通过重新审视式(3.4)中定义的一般目标的标准选择,来考虑背景类分布中的语义转移。

Revisiting Cross-Entropy Loss. In Eq.(3.4),a possible choice for ce is the standard cross-entropy loss computed over all image pixels:
重新审视交叉熵损失。在式(3.4)中,ce的一种可能选择是在所有图像像素上计算的标准交叉熵损失:

(3.5)ceθt(x,y)=1|I|iIlogqxt(i,yi),

where yiYt is the ground truth label associated to pixel i and qxt(i,c)=fθt(x)[i,c] .
其中yiYt是与像素i相关联的真实标签,且qxt(i,c)=fθt(x)[i,c]

The problem with Eq.(3.5) is that the training set Tt we use to update the model only contains information about novel classes in Ct . However,the background class in Tt might include also pixels associated to the previously seen classes in Yt1 . Here we argue that, without explicitly taking into account this aspect, the catastrophic forgetting problem would be even more severe. In fact, we would drive our model to predict the background label b for pixels of old classes,further degrading the capability of the model to preserve semantic knowledge of past categories. To avoid this issue, we propose to modify the cross-entropy loss in Eq.(3.5) as follows:
式(3.5)的问题在于,我们用于更新模型的训练集Tt仅包含Ct中新类别的信息。然而,Tt中的背景类可能还包括与Yt1中先前见过的类相关联的像素。我们认为,如果不明确考虑这一方面,灾难性遗忘问题会更加严重。事实上,我们会驱使我们的模型为旧类别的像素预测背景标签b,从而进一步降低模型保留过去类别语义知识的能力。为了避免这个问题,我们建议按如下方式修改式(3.5)中的交叉熵损失:

(3.6)ceθt(x,y)=1|I|iIlogq~xt(i,yi),

where:
其中:

(3.7)q~xt(i,c)={qxt(i,c) if cbkYt1qxt(i,k) if c=b.

Our intuition is that by using Eq.(3.6) we can update the model to predict the new classes and, at the same time, account for the uncertainty over the actual content of the background class. In fact, in Eq.(3.6) the background class ground truth is not directly compared with its probabilities qxt(i,b) obtained from the current model fθt ,but with the probability of having either an old class or the background,as predicted by fθt (Eq.(3.7)). A schematic representation of this procedure is depicted in Fig. 3.6 (blue block). It is worth noting that the alternative of ignoring the background pixels within the cross-entropy loss is a sub-optimal solution. In fact, this would not allow to adapt the background classifier to its semantic shift and to exploit the information that new images might contain about old classes.
我们的直觉是,通过使用公式(3.6),我们可以更新模型以预测新的类别,同时考虑背景类实际内容的不确定性。事实上,在公式(3.6)中,背景类的真实标签并非直接与从当前模型fθt获得的其概率qxt(i,b)进行比较,而是与由fθt预测的属于旧类别或背景的概率进行比较(公式(3.7))。这一过程的示意图如图3.6(蓝色块)所示。值得注意的是,在交叉熵损失中忽略背景像素的替代方法是一种次优解决方案。实际上,这将不允许背景分类器适应其语义变化,也无法利用新图像可能包含的关于旧类别的信息。

Revisiting Distillation Loss. In the context of incremental learning, distillation loss [102] is a common strategy to transfer knowledge from the old model fθt1 into the new one, preventing catastrophic forgetting. Formally, a standard choice for the distillation loss kd is:
重新审视蒸馏损失。在增量学习的背景下,蒸馏损失[102]是一种常见的策略,用于将知识从旧模型fθt1转移到新模型,防止灾难性遗忘。形式上,蒸馏损失kd的标准选择是:

(3.8)kdθt(x,y)=1|I|iIcYt1qxt1(i,c)logq^xt(i,c),

where q^xt(i,c) is defined as the probability of class c for pixel i given by fθt but re-normalized across all the classes in Yt1 i.e.:
其中q^xt(i,c)定义为在给定fθt的情况下,像素i属于类别c的概率,但要在Yt1中的所有类别上重新归一化,即:

(3.9)q^xt(i,c)={0 if cCt{bqxt(i,c)/kYt1qxt(i,k) if cYt1.

The rationale behind kd is that fθt should produce activations close to the ones produced by fθt1 . This regularizes the training procedure in such a way that the parameters θt are still anchored to the solution found for recognizing pixels of the previous classes,i.e. θt1 .
kd背后的基本原理是,fθt产生的激活值应接近fθt1产生的激活值。这以一种使参数θt仍然锚定在识别先前类别像素的解决方案上的方式对训练过程进行正则化,即θt1

The loss defined in Eq.(3.8) has been used either in its base form or variants in different contexts,from incremental task [144] and class learning [216,30] in object classification to complex scenarios such as detection [240] and segmentation [178]. Despite its success, it has a fundamental drawback in semantic segmentation: it completely ignores the fact that the background class is shared among different learning steps. While with Eq.(3.6) we tackled the first problem linked to the semantic shift of the background (i.e. bTt contains pixels of Yt1 ),we use the distillation loss to tackle the second: annotations for background in Ts with s<t might include pixels of classes in Ct .
公式(3.8)中定义的损失在不同的上下文中以其基本形式或变体被使用,从目标分类中的增量任务[144]和类别学习[216,30]到检测[240]和分割[178]等复杂场景。尽管它取得了成功,但在语义分割中存在一个根本缺陷:它完全忽略了背景类在不同学习步骤中是共享的这一事实。虽然通过公式(3.6)我们解决了与背景语义变化相关的第一个问题(即bTt包含Yt1的像素),但我们使用蒸馏损失来解决第二个问题:在Ts中使用s<t对背景的标注可能包含Ct中类别的像素。

From the latter considerations, the background probabilities assigned to a pixel by the old predictor fθt1 and by the current model fθt do not share the same semantic content. More importantly, fθt1 might predict as background pixels of classes in Ct that we are currently trying to learn. Notice that this aspect is peculiar to the segmentation task and it is not considered in previous incremental learning models. However, in our setting we must explicitly take it into account to perform a correct distillation of the old model into the new one. To this extent we define our novel distillation loss by rewriting q^xt(i,c) in Eq. (3.9) as:
从后一种考虑来看,旧预测器fθt1和当前模型fθt分配给一个像素的背景概率并不具有相同的语义内容。更重要的是,fθt1可能会将我们当前试图学习的Ct中类别的像素预测为背景。请注意,这一方面是分割任务所特有的,并且在先前的增量学习模型中并未被考虑。然而,在我们的设定中,我们必须明确考虑这一点,以便将旧模型正确地蒸馏到新模型中。为此,我们通过将公式(3.9)中的q^xt(i,c)重写为以下形式来定义我们的新型蒸馏损失:
(3.10)q^xt(i,c)={qxt(i,c) if cbkCtqxt(i,k) if c=b.

Similarly to Eq.(3.8), we still compare the probability of a pixel belonging to seen classes assigned by the old model, with its counterpart computed with the current parameters θt . However,differently from classical distillation,in Eq.(3.10) the probabilities obtained with the current model are kept unaltered, i.e. normalized across the whole label space Yt and not with respect to the subset Yt1 (Eq.(3.9)). More importantly,the background class probability as given by fθt1 is not directly compared with its counterpart in fθt ,but with the probability of having either a new class or the background,as predicted by fθt (see Fig. 3.6,yellow block).
与公式(3.8)类似,我们仍然将旧模型为像素分配的属于已见类别的概率,与使用当前参数θt计算得到的对应概率进行比较。然而,与经典蒸馏不同的是,在公式(3.10)中,使用当前模型得到的概率保持不变,即对整个标签空间Yt进行归一化,而不是相对于子集Yt1进行归一化(公式(3.9))。更重要的是,由fθt1给出的背景类别概率并不直接与fθt中的对应概率进行比较,而是与fθt预测的具有a新类别或背景的概率进行比较(见图3.6,黄色块)。

We highlight that, with respect to Eq.(3.9) and other simple choices (e.g. excluding b from Eq.(3.9)) this solution has two advantages. First,we can still use the full output space of the old model to distill knowledge in the current one, without any constraint on pixels and classes. Second, we can propagate the uncertainty we have on the semantic content of the background in fθt1 without penalizing the probabilities of new classes we are learning in the current step t .
我们强调,相对于公式(3.9)和其他简单选择(例如,从公式(3.9)中排除b),这种解决方案有两个优点。首先,我们仍然可以使用旧模型的完整输出空间,在当前模型中进行知识蒸馏,而对像素和类别没有任何限制。其次,我们可以传播我们对fθt1中背景语义内容的不确定性,而不会对我们在当前步骤t中正在学习的新类别的概率进行惩罚。

Classifiers' Parameters Initialization. As discussed above, the background class b is a special class devoted to collect the probability that a pixel belongs to an unknown object class. In practice,at each learning step t ,the novel categories in Ct are unknowns for the old classifier fθt1 . As a consequence,unless the appearance of a class in Ct is very similar to one in Yt1 ,it is reasonable to assume that fθt1 will likely assign pixels of Ct to b. Taking into account this initial bias on the predictions of fθt on pixels of Ct ,it is detrimental to randomly initialize the classifiers for the novel classes. In fact a random initialization would provoke a misalignment among the features extracted by the model (aligned with the background classifier) and the random parameters of the classifier itself. Notice that this could lead to possible training instabilities while learning novel classes since the network could initially assign high probabilities for pixels in Ct to b.
分类器的参数初始化。如上所述,背景类别b是一个特殊类别,用于收集像素属于未知对象类别的概率。实际上,在每个学习步骤tCt中的新类别对于旧分类器fθt1来说是未知的。因此,除非Ct中某个类别的外观与Yt1中的某个类别非常相似,否则可以合理地假设fθt1可能会将Ct的像素分配给背景类别b。考虑到fθtCt像素的预测存在这种初始偏差,随机初始化新类别的分类器是有害的。事实上,随机初始化会导致模型提取的特征(与背景分类器对齐)和分类器本身的随机参数之间出现不一致。请注意,这可能会在学习新类别时导致训练不稳定,因为网络最初可能会将Ct中的像素分配给背景类别b的概率设得很高。

To address this issue, we propose to initialize the classifier's parameters for the novel classes in such a way that given an image x and a pixel i ,the probability of the background qxt1(i,b) is uniformly spread among the classes in Ct ,i.e. qxt(i,c)= qxt1(i,b)/|Ct|cCt ,where |Ct| is the number of new classes (notice that bCt ). To this extent, let us consider a standard fully connected classifier and let us denote as {ωct,βct}θt the classifier parameters for a class c at learning step t ,with ω and β denoting its weights and bias respectively. We can initialize {ωct,βct} as follows:
为了解决这个问题,我们建议以这样的方式初始化新类别的分类器参数:给定一幅图像x和一个像素i,背景qxt1(i,b)的概率均匀分布在Ct中的各个类别之间,即qxt(i,c)= qxt1(i,b)/|Ct|cCt,其中|Ct|是新类别的数量(注意bCt)。为此,让我们考虑一个标准的全连接分类器,并将学习步骤t时类别c的分类器参数表示为{ωct,βct}θt,其中ωβ分别表示其权重和偏置。我们可以按如下方式初始化{ωct,βct}

(3.11)ωct={ωbt1 if cCtωct1 otherwise 

(3.12)βct={βbt1log(|Ct|) if cCtβct1 otherwise 

where {ωbt1,βbt1} are the weights and bias of the background classifier at the previous learning step. The fact that the initialization defined in Eq.(3.11) and (3.12) leads to qxt(i,c)=qxt1(i,b)/|Ct|cCt is easy to obtain from qxt(i,c) exp(ωbtx+βbt) .
其中 {ωbt1,βbt1} 是上一学习步骤中背景分类器的权重和偏置。由式(3.11)和(3.12)定义的初始化会导致 qxt(i,c)=qxt1(i,b)/|Ct|cCt 这一事实,很容易从 qxt(i,c) exp(ωbtx+βbt) 中得出。
As we will show in the experimental analysis, this simple initialization procedure brings benefits in terms of both improving the learning stability of the model and the final results, since it eases the role of the supervision imposed by Eq.(3.6) while learning new classes and follows the same principles used to derive our distillation loss (Eq.(3.10)).
正如我们将在实验分析中展示的,这种简单的初始化过程在提高模型学习稳定性和最终结果方面都带来了好处,因为它在学习新类别时减轻了式(3.6)所施加的监督作用,并且遵循了推导我们的蒸馏损失(式(3.10))所使用的相同原则。

3.4.3 Experimental results
3.4.3 实验结果


ICL Baselines
增量式持续学习(ICL)基线


We compare MiB against standard ICL baselines, originally designed for classification tasks, on the considered segmentation task, thus segmentation is treated as a pixel-level classification problem. Specifically, we report the results of six different regularization-based methods, three prior-focused and three data-focused.
我们在考虑的分割任务上,将MiB与最初为分类任务设计的标准增量式持续学习(ICL)基线进行比较,因此分割被视为像素级分类问题。具体来说,我们报告了六种不同的基于正则化的方法的结果,其中三种侧重于先验,三种侧重于数据。

In the first category, we chose Elastic Weight Consolidation (EWC) [118], Path Integral (PI) [300], and Riemannian Walks (RW) [36]. They employ different strategies to compute the importance of each parameter for old classes: EWC uses the empirical Fisher matrix, PI uses the learning trajectory, while RW combines EWC and PI in a unique model. We choose EWC since it is a standard baseline employed also in [240] and PI and RW since they are two simple applications of the same principle. Since these methods act at the parameter level, to adapt them to the segmentation task we keep the loss in the output space unaltered (i.e. standard cross-entropy across the whole segmentation mask), computing the parameters' importance by considering their effect on learning old classes.
在第一类中,我们选择了弹性权重巩固(Elastic Weight Consolidation,EWC) [118]、路径积分(Path Integral,PI) [300] 和黎曼漫步(Riemannian Walks,RW) [36]。它们采用不同的策略来计算每个参数对旧类别的重要性:EWC使用经验费舍尔矩阵,PI使用学习轨迹,而RW在一个独特的模型中结合了EWC和PI。我们选择EWC是因为它也是 [240] 中使用的标准基线,选择PI和RW是因为它们是同一原则的两种简单应用。由于这些方法在参数层面上起作用,为了使它们适应分割任务,我们保持输出空间中的损失不变(即整个分割掩码上的标准交叉熵),通过考虑参数对学习旧类别的影响来计算参数的重要性。

For the data-focused methods, we chose Learning without forgetting (LwF) [144], LwF multi-class (LwF-MC) [216] and the segmentation method of [178] (ILT). We denote as LwF the original distillation based objective as implemented in Eq.(3.4) with basic cross-entropy and distillation losses, which is the same as [144] except that distillation and cross-entropy share the same label space and classifier. LwF-MC is the single-head version of [144] as adapted from [216]. It is based on multiple binary classifiers, with the target labels defined using the ground truth for novel classes (i.e. Ct ) and the probabilities given by the old model for the old ones (i.e. Yt1 ). Since the background class is both in Ct and Yt1 we implement LwF-MC by a weighted combination of two binary cross-entropy losses, on both the ground truth and the probabilities given by fθt1 . Finally,ILT [178] is the only method specifically proposed for ICL in semantic segmentation. It uses a distillation loss in the output space, as in our adapted version of LwF [144] and/or another distillation loss in the features space, attached to the output of the network decoder. Here, we use the variant where both losses are employed. As done by [240], we do not compare with replay-based methods (e.g. [216]) since they violate the standard ICL assumption regarding the unavailability of old data.
对于侧重于数据的方法,我们选择了无遗忘学习(Learning without forgetting,LwF) [144]、多类无遗忘学习(LwF multi - class,LwF - MC) [216] 和 [178] 中的分割方法(ILT)。我们将式(3.4)中实现的基于原始蒸馏的目标函数称为LwF,它具有基本的交叉熵和蒸馏损失,这与 [144] 相同,只是蒸馏和交叉熵共享相同的标签空间和分类器。LwF - MC是 [144] 的单头版本,改编自 [216]。它基于多个二分类器,目标标签使用新类别的真实标签(即 Ct )和旧模型对旧类别的概率(即 Yt1 )来定义。由于背景类别同时存在于 CtYt1 中,我们通过对真实标签和 fθt1 给出的概率分别使用两个二元交叉熵损失的加权组合来实现LwF - MC。最后,ILT [178] 是专门为语义分割中的增量式持续学习(ICL)提出的唯一方法。它在输出空间中使用蒸馏损失,就像我们改编的LwF [144] 版本一样,和/或在特征空间中使用另一个蒸馏损失,该损失附加到网络解码器的输出上。在这里,我们使用同时采用这两种损失的变体。与 [240] 一样,我们不与基于重放的方法(例如 [216])进行比较,因为它们违反了关于旧数据不可用的标准增量式持续学习(ICL)假设。

In all tables we report other two baselines: simple fine-tuning (FT) on each Tt (e.g. Eq.(3.5)) and training on all classes offline (Joint). The latter can be regarded as an upper bound. All results are reported as mean Intersection-over-Union (mIoU) in percentage, averaged over all the classes of a learning step and all the steps.
在所有表格中,我们还报告了另外两个基线:在每个 Tt 上进行简单微调(Fine - tuning,FT)(例如式(3.5))和离线对所有类别进行训练(联合训练,Joint)。后者可以被视为一个上限。所有结果均以平均交并比(mean Intersection - over - Union,mIoU)的百分比形式报告,该值是在一个学习步骤的所有类别以及所有步骤上进行平均得到的。

Implementation Details
实现细节


For all methods we use the Deeplab-v3 architecture [38] with a ResNet-101 [98] backbone and output stride of 16. Since memory requirements are an important issue in semantic segmentation, we use in-place activated batch normalization, as proposed in [224]. The backbone has been initialized using the ImageNet pre-trained model [224]. We follow [38], training the network with SGD and the same learning rate policy,momentum and weight decay. We use an initial learning rate of 102 for the first learning step and 103 for the followings,as in [240]. We train the model with a batch size of 24 for 30 epochs for Pascal-VOC 2012 and 60 epochs for ADE20K in every learning step. We apply the same data augmentation of [38] and we crop the images to 512×512 during both training and test. For setting the hyper-parameters of each method, we use the protocol of incremental learning defined in [49], using 20% of the training set as validation. The final results are reported on the standard validation set of the datasets.
对于所有方法,我们使用带有ResNet - 101 [98]主干网络且输出步幅为16的Deeplab - v3架构[38]。由于内存需求是语义分割中的一个重要问题,我们采用了文献[224]中提出的原地激活批量归一化方法。主干网络使用ImageNet预训练模型[224]进行初始化。我们遵循文献[38],使用随机梯度下降法(SGD)以及相同的学习率策略、动量和权重衰减来训练网络。与文献[240]一样,我们在第一个学习步骤使用102的初始学习率,在后续步骤使用103的学习率。在每个学习步骤中,对于Pascal - VOC 2012数据集,我们以24的批量大小训练模型30个周期;对于ADE20K数据集,训练60个周期。我们采用与文献[38]相同的数据增强方法,并且在训练和测试期间将图像裁剪为512×512。为了设置每种方法的超参数,我们使用文献[49]中定义的增量学习协议,将训练集的20%用作验证集。最终结果在数据集的标准验证集上报告。

Pascal-VOC 2012
Pascal - VOC 2012


PASCAL-VOC 2012 [64] is a widely used benchmark that includes 20 foreground object classes. Following [178,240] ,we define two experimental settings,depending on how we sample images to build the incremental datasets. Following [178], we define an experimental protocol called the disjoint setup: each learning step contains a unique set of images, whose pixels belong to classes seen either in the current or in the previous learning steps. Differently from [178], at each step we assume to have only labels for pixels of novel classes, while the old ones are labeled as background in the ground truth. The second setup, that we denote as overlapped, follows what has been done in [240] for detection: each training step contains all the images that have at least one pixel of a novel class, with only the latter annotated. It is important to note a difference with respect to the previous setup: images may now contain pixels of classes that we will learn in the future, but labeled as background. This is a more realistic setup since it does not make any assumption on the objects present in the images.
PASCAL - VOC 2012 [64]是一个广泛使用的基准数据集,包含20个前景目标类别。遵循[178,240],我们根据如何采样图像来构建增量数据集,定义了两种实验设置。遵循文献[178],我们定义了一种称为不相交设置的实验协议:每个学习步骤包含一组独特的图像,其像素属于当前或先前学习步骤中见过的类别。与文献[178]不同的是,在每个步骤中,我们假设仅对新类别的像素有标签,而在真实标签中,旧类别的像素被标记为背景。第二种设置,我们称之为重叠设置,遵循文献[240]在目标检测中的做法:每个训练步骤包含所有至少有一个新类别像素的图像,并且仅对新类别进行标注。需要注意的是,与前一种设置的区别在于:图像现在可能包含我们未来将学习的类别的像素,但被标记为背景。这是一种更现实的设置,因为它不对图像中存在的目标做任何假设。

Following previous works [240,178] ,we perform three different experiments concerning the addition of one class (19-1), five classes all at once (15-5), and five classes sequentially (15-1), following the alphabetical order of the classes to split the content of each learning step.
遵循先前的工作[240,178],我们进行了三个不同的实验,分别是添加一个类别(19 - 1)、一次性添加五个类别(15 - 5)以及按顺序添加五个类别(15 - 1),按照类别的字母顺序来划分每个学习步骤的内容。

Addition of one class (19-1). In this experiment, we perform two learning steps: the first in which we observe the first 19 classes, and the second where we learn the tv-monitor class. Results are reported in Table 3.7 for the disjoint scenario and in Table 3.8 for the overlapped. Without employing any regularization strategy, the performance on past classes drops significantly. FT, in fact, performs poorly, completely forgetting the first 19 classes. Unexpectedly, using PI as a regularization strategy does not provide benefits, while EWC and RW improve performance of nearly 15%. However, prior-focused strategies are not competitive with data-focused ones. In fact, LwF, LwF-MC, and ILT, outperform them by a large margin, confirming the effectiveness of this approch on preventing catastrophic forgetting. While ILT surpasses standard ICL baselines, our model is able to obtain a further boost. This improvement is remarkable for new classes,where we gain 11% in mIoU, while do not experience forgetting on old classes. It is especially interesting to compare MiB with the baseline LwF which uses the same principles of our method but without modeling the background. Compared to LwF we achieve an average improvement of about 15% ,thus demonstrating the importance of modeling the background in ICL for semantic segmentation. These results are consistent in both the disjoint and overlapped scenarios.
添加一个类别(19 - 1)。在这个实验中,我们进行两个学习步骤:第一步观察前19个类别,第二步学习电视监视器(tv - monitor)类别。不相交场景的结果报告在表3.7中,重叠场景的结果报告在表3.8中。如果不采用任何正则化策略,对过去类别的性能会显著下降。事实上,微调(FT)方法表现不佳,完全遗忘了前19个类别。出乎意料的是,使用渐进式集成(PI)作为正则化策略并没有带来好处,而弹性权重巩固(EWC)和随机权重整合(RW)方法使性能提高了近15%。然而,基于先验的策略与基于数据的策略相比没有竞争力。事实上,学习无遗忘(LwF)、多类别学习无遗忘(LwF - MC)和增量学习微调(ILT)方法远远超过了它们,证实了这种方法在防止灾难性遗忘方面的有效性。虽然ILT超过了标准的增量类别学习(ICL)基线方法,但我们的模型能够进一步提升性能。这种提升在新类别上尤为显著,我们在平均交并比(mIoU)上提高了11%,同时在旧类别上没有出现遗忘现象。将MiB方法与基线LwF方法进行比较特别有趣,LwF方法使用了与我们方法相同的原理,但没有对背景进行建模。与LwF方法相比,我们平均提高了约15%,从而证明了在语义分割的ICL中对背景进行建模的重要性。这些结果在不相交和重叠场景中都是一致的。

Table 3.7. Mean IoU on the Pascal-VOC 2012 dataset for the disjoint incremental class learning scenarios.
表3.7. 不相交增量类别学习场景下Pascal - VOC 2012数据集的平均交并比。

Method19-115-515-1
1-1920all1-1516-20all1-1516-20all
FT5.812.36.21.133.69.20.21.80.6
PI [300]5.414.15.91.334.19.50.01.80.4
EWC [118]23.216.022.926.737.729.40.34.31.3
RW [36]19.415.719.217.936.922.70.25.41.5
LwF [144]53.09.150.858.437.453.10.83.61.5
LwF-MC [216]63.013.260.567.241.260.74.57.05.2
ILT [178]69.116.466.463.239.557.33.75.74.2
MiB69.625.667.471.843.364.746.212.937.9
Joint77.478.077.479.172.677.479.172.677.4
方法19-115-515-1
1-1920全部1-1516-20全部1-1516-20全部
傅里叶变换(FT)5.812.36.21.133.69.20.21.80.6
渐进式推理(PI) [300]5.414.15.91.334.19.50.01.80.4
弹性权重巩固(EWC) [118]23.216.022.926.737.729.40.34.31.3
重放(RW) [36]19.415.719.217.936.922.70.25.41.5
学习不遗忘(LwF) [144]53.09.150.858.437.453.10.83.61.5
多分类学习不遗忘(LwF - MC) [216]63.013.260.567.241.260.74.57.05.2
增量式学习迁移(ILT) [178]69.116.466.463.239.557.33.75.74.2
兆字节(MiB)69.625.667.471.843.364.746.212.937.9
联合77.478.077.479.172.677.479.172.677.4

Table 3.8. Mean IoU on the Pascal-VOC 2012 dataset for the overlapped incremental class learning scenario.
表3.8. 重叠增量类学习场景下Pascal - VOC 2012数据集的平均交并比(IoU)。

Method19-115-515-1
1-1920all1-1516-20all1-1516-20all
FT6.812.97.12.133.19.80.21.80.6
PI [300]7.514.07.81.633.39.50.01.80.5
EWC [118]26.914.026.324.335.527.10.34.31.3
RW [36]23.314.222.916.634.921.20.05.21.3
LwF [144]51.28.549.158.936.653.31.03.91.8
LwF-MC [216]64.413.361.958.135.052.36.48.46.9
ILT [178]67.112.364.466.340.659.94.97.85.7
MiB70.222.167.875.549.469.035.113.529.7
Joint77.478.077.479.172.677.479.172.677.4
方法19-115-515-1
1-1920全部1-1516-20全部1-1516-20全部
傅里叶变换(FT)6.812.97.12.133.19.80.21.80.6
渐进式推理(PI) [300]7.514.07.81.633.39.50.01.80.5
弹性权重巩固(EWC) [118]26.914.026.324.335.527.10.34.31.3
重放(RW) [36]23.314.222.916.634.921.20.05.21.3
学习不遗忘(LwF) [144]51.28.549.158.936.653.31.03.91.8
多分类学习不遗忘(LwF - MC) [216]64.413.361.958.135.052.36.48.46.9
增量式学习迁移(ILT) [178]67.112.364.466.340.659.94.97.85.7
兆字节(MiB)70.222.167.875.549.469.035.113.529.7
联合77.478.077.479.172.677.479.172.677.4


Single-step addition of five classes (15-5). In this setting we add, after the first training set, the following classes: plant, sheep, sofa, train, tv-monitor. As before, results are reported in Table 3.7 for the disjoint scenario and in Table 3.8 for the overlapped. Overall, the behavior on the first 15 classes is consistent with the 19-1 setting: FT and PI suffer a large performance drop, data-focused strategies (LwF, LwF-MC, ILT) outperform EWC and RW by far, while MiB gets the best results, obtaining performances closer to the joint training upper bound. For what concerns the disjoint scenario, our method improves over the best baseline of 4.6% on old classes,of 2% on novel ones and of 4% in all classes. These gaps increase in the overlapped setting where MiB surpasses the baselines by nearly 10% in all cases, clearly demonstrating its ability to take advantage of the information contained in the background class.
一次性添加五个类别(15 - 5)。在这种设置下,我们在第一个训练集之后添加以下类别:植物(plant)、绵羊(sheep)、沙发(sofa)、火车(train)、电视显示器(tv - monitor)。和之前一样,不相交场景的结果报告在表3.7中,重叠场景的结果报告在表3.8中。总体而言,前15个类别的表现与19 - 1设置一致:微调(FT)和原型集成(PI)的性能大幅下降,以数据为中心的策略(学习无遗忘(LwF)、多类学习无遗忘(LwF - MC)、增量学习树(ILT))远远优于弹性权重巩固(EWC)和随机权重(RW),而基于记忆的增量学习(MiB)取得了最佳结果,其性能更接近联合训练上限。对于不相交场景,我们的方法在旧类别上比最佳基线提高了4.6%,在新类别上提高了2%,在所有类别上提高了4%。在重叠设置中,这些差距进一步增大,在所有情况下,基于记忆的增量学习(MiB)都比基线高出近10%,这清楚地表明了它能够利用背景类别中包含的信息。

Table 3.9. Ablation study of the proposed method on the Pascal-VOC 2012 overlapped setup. CE and KD denote our cross-entropy and distillation losses,while init our initialization strategy.
表3.9. 所提出的方法在Pascal - VOC 2012重叠设置上的消融研究。CEKD分别表示我们的交叉熵损失和蒸馏损失,而init表示我们的初始化策略。

19-115-515-1
1-1920all1-1516-20all1-1516-20all
LwF [144]51.28.549.158.936.653.31.03.91.8
+CE57.69.955.263.238.157.012.03.79.9
+KD66.011.963.372.946.366.334.84.527.2
+init70.222.167.875.549.469.035.113.529.7
19-115-515-1
1-1920所有1-1516-20所有1-1516-20所有
学习不遗忘(Learning without Forgetting,LwF) [144]51.28.549.158.936.653.31.03.91.8
+CE57.69.955.263.238.157.012.03.79.9
+KD66.011.963.372.946.366.334.84.527.2
+初始化70.222.167.875.549.469.035.113.529.7


Multi-step addition of five classes (15-1). This setting is similar to the previous one except that the last 5 classes are learned sequentially, one by one. From Table 3.7 and Table 3.8, we can observe that performing multiple steps is challenging and existing methods work poorly for this setting, reaching performance inferior to 7% on both old and new classes. In particular, FT and prior-focused methods are unable to prevent forgetting, biasing their prediction completely towards new classes and demonstrating performances close to 0% on the first 15 classes. Even data-focused methods suffer a dramatic loss in performances in this setting, decreasing their score from the single to the multi-step scenarios of more than 50% on all classes. On the other side, MiB is still able to achieve good performances. Compared to the other approaches, MiB outperforms all baselines by a large margin in both old (46.2% on the disjoint and 35.1% on the overlapped), and new (nearly 13% on both setups) classes. As the overall performance drop (11% on all classes) shows, the overlapped scenario is the most challenging one since it does not impose any constraint on which classes are present in the background.
五类的多步添加(15 - 1)。此设置与上一个类似,不同之处在于最后5类是逐个依次学习的。从表3.7和表3.8中,我们可以观察到,执行多步操作具有挑战性,并且现有方法在这种设置下效果不佳,在旧类和新类上的性能均低于7%。特别是,微调(FT)和先验聚焦方法无法防止遗忘,将其预测完全偏向新类,并且在最初的15类上的性能接近0%。即使是数据聚焦方法在这种设置下的性能也大幅下降,在所有类上,从单步场景到多步场景,其得分下降超过50%。另一方面,MiB(混合批量归一化)仍然能够取得良好的性能。与其他方法相比,MiB在旧类(不相交设置下为46.2%,重叠设置下为35.1%)和新类(两种设置下均接近13%)上都大幅优于所有基线方法。正如整体性能下降(所有类上下降11%)所示,重叠场景是最具挑战性的,因为它对背景中存在哪些类没有施加任何限制。

Ablation Study. In Table 3.9 we report a detailed analysis of our contributions, considering the overlapped setup. We start from the baseline LwF [144] which employs standard cross-entropy and distillation losses. We first add to the baseline our modified cross-entropy(CE): this increases the ability to preserve old knowledge in all settings without harming (15-1) or even improving (19-1, 15-5) performances on the new classes. Second,we add our distillation loss(KD)to the model. Our KD provides a boost on the performances for both old and new classes. The improvement on old classes is remarkable,especially in the 15-1 scenario (i.e. 22.8%). For the novel classes, the improvement is constant and is especially pronounced in the 15-5 scenario (7%). Notice that this aspect is peculiar of our KD since standard formulation work only on preserving old knowledge. This shows that the two losses provide mutual benefits. Finally, we add our classifiers' initialization strategy (init). This component provides an improvement in every setting, especially on novel classes: it doubles the performance on the 19-1 setting (22.1% vs 11.9%) and triplicates on the 15-1 (4.5% vs 13.5%). This confirms the importance of accounting for the background shift at the initialization stage to facilitate the learning of new classes.
消融研究。在表3.9中,我们报告了对我们所做贡献的详细分析,考虑的是重叠设置。我们从基线方法LwF [144]开始,该方法采用标准的交叉熵和蒸馏损失。我们首先在基线上添加我们修改后的交叉熵(CE):这在所有设置中都提高了保留旧知识的能力,且不会损害(15 - 1)甚至会提高(19 - 1、15 - 5)新类的性能。其次,我们将我们的蒸馏损失(KD)添加到模型中。我们的KD对旧类和新类的性能都有提升。在旧类上的提升非常显著,特别是在15 - 1场景中(即22.8%)。对于新类,提升是持续的,并且在15 - 5场景中尤为明显(7%)。请注意,这是我们的KD的独特之处,因为标准公式仅用于保留旧知识。这表明这两种损失相互受益。最后,我们添加了我们的分类器初始化策略(init)。这一组件在每个设置中都有改进,特别是在新类上:它使19 - 1设置下的性能翻倍(从11.9%提升到22.1%),并使15 - 1设置下的性能增至三倍(从4.5%提升到13.5%)。这证实了在初始化阶段考虑背景偏移以促进新类学习的重要性。

Table 3.10. Mean IoU on the ADE20K dataset for different incremental class learning scenarios, adding 50 classes at each step.
表3.10. 不同增量类学习场景下,在ADE20K数据集上每次添加50类的平均交并比(IoU)。

Method100-5050-50
1-100101-150all1-5051-100101-150all
FT0.024.98.30.00.022.07.3
LwF [144]21.125.622.65.712.922.813.9
LwF-MC [216]34.210.526.327.87.010.415.1
ILT [178]22.918.921.68.49.714.310.8
MiB37.927.934.635.522.223.627.0
Joint44.328.238.951.138.328.238.9
方法100-5050-50
1-100101-150全部1-5051-100101-150全部
微调(Fine-Tuning)0.024.98.30.00.022.07.3
学习不遗忘(Learning without Forgetting) [144]21.125.622.65.712.922.813.9
多分类学习不遗忘(Learning without Forgetting - Multi-Class) [216]34.210.526.327.87.010.415.1
增量式学习训练(Incremental Learning Training) [178]22.918.921.68.49.714.310.8
兆字节(Mebibyte)37.927.934.635.522.223.627.0
联合44.328.238.951.138.328.238.9


Table 3.11. Mean IoU on the ADE20K dataset for a multi-step incremental class learning scenario, adding 50 classes in 5 steps.
表3.11. 多步增量式类别学习场景下在ADE20K数据集上的平均交并比(IoU),分5步添加50个类别。


100-10

Method1-100100-110110-120120-130130-140140-150all
FT0.00.00.00.00.016.61.1
LwF [144]0.10.00.42.64.616.91.7
LwF-MC [216]18.72.58.74.16.55.114.3
ILT [178]0.30.01.02.14.610.71.4
MiB31.810.414.812.813.618.725.9
Joint44.326.142.826.728.117.338.9
方法1-100100-110110-120120-130130-140140-150全部
微调(Fine-Tuning)0.00.00.00.00.016.61.1
学习不遗忘(Learning without Forgetting,LwF) [144]0.10.00.42.64.616.91.7
多分类学习不遗忘(Learning without Forgetting for Multi-Class,LwF-MC) [216]18.72.58.74.16.55.114.3
增量式学习训练(Incremental Learning Training,ILT) [178]0.30.01.02.14.610.71.4
兆字节(Mebibyte,MiB)31.810.414.812.813.618.725.9
联合44.326.142.826.728.117.338.9


ADE20K
ADE20K数据集


ADE20K [309] is a large-scale dataset that contains 150 classes. Differently from Pascal-VOC 2012, this dataset contains both stuff (e.g. sky, building, wall) and object classes. We create the incremental datasets Tt by splitting the whole dataset into disjoint image sets, without any constraint except ensuring a minimum number of images (i.e. 50) where classes on Ct have labeled pixels. Obviously,each Tt provides annotations only for classes in Ct while other classes (old or future) appear as background in the ground truth. In Table 3.10 and Table 3.11 we report the mean IoU obtained averaging the results on two different class orders: the order proposed by [309] and a random one. In this experiments, we compare MiB with data-focused methods only (i.e. LwF, LwF-MC, and ILT) due to their gap in performance with prior-focused ones.
ADE20K数据集 [309] 是一个包含150个类别的大规模数据集。与Pascal - VOC 2012数据集不同,该数据集既包含场景类别(如天空、建筑物、墙壁),也包含物体类别。我们通过将整个数据集划分为不相交的图像集来创建增量数据集 Tt,除了确保 Ct 中的类别有标记像素的图像数量最少为50张外,没有任何其他限制。显然,每个 Tt 仅为 Ct 中的类别提供注释,而其他类别(旧类别或未来类别)在真实标签中显示为背景。在表3.10和表3.11中,我们报告了在两种不同类别顺序下得到的平均交并比(IoU):[309] 提出的顺序和随机顺序。在这些实验中,由于以数据为中心的方法与以先验为中心的方法在性能上存在差距,我们仅将MiB方法与以数据为中心的方法(即LwF、LwF - MC和ILT)进行比较。

Single-step addition of 50 classes (100-50). In the first experiment, we initially train the network on 100 classes and we add the remaining 50 all at once. From Table 3.10 we can observe that FT is clearly a bad strategy on large scale settings since it completely forgets old knowledge. Using a distillation strategy enables the network to reduce the catastrophic forgetting: LwF obtains 21.1% on past classes, ILT 22.9%, and LwF-MC 34.2%. Regarding new classes, LwF is the best strategy, exceeding LwF-MC by 18.9% and ILT by 6.6%. However, MiB is far superior to all others, improving on the first classes and on the new ones. Moreover, we can observe that we are close to the joint training upper bound, especially considering new classes,where the gap with respect to it is only 0.3% . In Figure 3.7 we report some qualitative results which demonstrate the superiority of MiB compared to the baselines.
一次性添加50个类别(100 - 50)。在第一个实验中,我们首先在100个类别上训练网络,然后一次性添加剩余的50个类别。从表3.10中我们可以看出,在大规模设置下,微调(FT)显然是一种糟糕的策略,因为它会完全遗忘旧知识。使用蒸馏策略可以使网络减少灾难性遗忘:LwF方法在旧类别上的准确率为21.1%,ILT方法为22.9%,LwF - MC方法为34.2%。对于新类别,LwF方法是最佳策略,比LwF - MC方法高18.9%,比ILT方法高6.6%。然而,MiB方法远远优于其他所有方法,在旧类别和新类别上都有提升。此外,我们可以看到,我们接近联合训练的上限,特别是在新类别上,与上限的差距仅为 0.3%。在图3.7中,我们展示了一些定性结果,证明了MiB方法相对于基线方法的优越性。

https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_129.jpg?x=267&y=260&w=1116&h=642&r=0

Figure 3.7. Qualitative results on the 100-50 setting of the ADE20K dataset using different incremental methods. The image demonstrates the superiority of our approach on both new (e.g. building, floor, table) and old (e.g. car, wall, person) classes. From left to right: image, FT, LwF [144], ILT [178], LwF-MC [216], MiB , and the ground-truth. Best viewed in color.
图3.7. 在ADE20K数据集的100 - 50设置下,使用不同增量方法的定性结果。该图像展示了我们的方法在新类别(如建筑物、地板、桌子)和旧类别(如汽车、墙壁、人)上的优越性。从左到右依次为:原始图像、微调(FT)、LwF [144]、ILT [178]、LwF - MC [216]、MiB方法和真实标签。彩色显示效果更佳。


Multi-step addition of 50 classes (100-10). We then evaluate the performance on multiple incremental steps: we start from 100 classes and we add the remaining classes 10 by 10 , resulting in 5 incremental steps. In Table 3.11 we report the results on all sets of classes after the last learning step. In this setting the performance of FT, LwF and ILT are very poor because they strongly suffers catastrophic forgetting. LwF-MC demonstrates a better ability to preserve knowledge on old classes, at the cost of a performance drop on new classes. Again, MiB achieves the best trade-off between learning new classes and preserving past knowledge, outperforming LwF-MC by 11.6% considering all classes.
分多步添加50个类别(100 - 10)。然后,我们评估在多个增量步骤中的性能:我们从100个类别开始,每次添加10个剩余类别,共进行5个增量步骤。在表3.11中,我们报告了最后一个学习步骤后所有类别集的结果。在这种设置下,FT、LwF和ILT方法的性能非常差,因为它们严重受到灾难性遗忘的影响。LwF - MC方法在保留旧类别知识方面表现出更好的能力,但代价是新类别性能下降。同样,MiB方法在学习新类别和保留旧知识之间取得了最佳平衡,在所有类别上比LwF - MC方法高出11.6%。

Three steps of 50 classes (50-50). Finally, in Table 3.10 we analyze also the performance on three sequential steps of 50 classes. Previous ICL methods achieve different trade-offs between learning new classes and not forgetting old ones. LwF and ILT obtain a good score on new classes, but they forget old knowledge. On the contrary, LwF-MC preserves knowledge on the first 50 classes without being able to learn new ones. MiB outperforms all the baselines by a large margin with a gap of 11.9% on the best performing baseline,achieving the highest mIoU on every step. Remarkably, the highest gap is on the intermediate step, where there are classes that we must both learn incrementally and preserve from forgetting on the subsequent learning step.
分三步添加50个类别(50 - 50)。最后,在表3.10中,我们还分析了分三个连续步骤添加50个类别的性能。之前的增量类学习(ICL)方法在学习新类别和不遗忘旧类别之间取得了不同的平衡。LwF和ILT方法在新类别上取得了较好的分数,但它们会遗忘旧知识。相反,LwF - MC方法保留了前50个类别的知识,但无法学习新类别。MiB方法大幅优于所有基线方法,与表现最佳的基线方法的差距为 11.9%,在每个步骤中都取得了最高的平均交并比(mIoU)。值得注意的是,最大差距出现在中间步骤,在这个步骤中,我们既要逐步学习新类别,又要在后续学习步骤中防止旧类别遗忘。

3.4.4 Conclusions
3.4.4 结论


In this section, we studied the incremental class learning problem for semantic segmentation, analyzing the realistic scenario where the new training set does not provide annotations for old classes, leading to the semantic shift of the background class and exacerbating the catastrophic forgetting problem. We address this issue by proposing a novel objective function and a classifiers' initialization strategy which allows our network to explicitly model the semantic shift of the background, effectively learning new classes without deteriorating its ability to recognize old ones. Results show that MiB outperforms regularization-based ICL methods by a large margin, considering both small and large scale datasets. We believe that our problem formulation, our approach and our extensive comparison with previous methods will encourage future works on this novel research topic, especially in the direction of effectively including the semantic shift in the background class in ICL models in semantic segmentation.
在本节中,我们研究了语义分割的增量类学习问题,分析了新训练集不为旧类提供标注的现实场景,这会导致背景类的语义偏移,并加剧灾难性遗忘问题。我们通过提出一种新颖的目标函数和分类器初始化策略来解决这个问题,该策略使我们的网络能够明确地对背景的语义偏移进行建模,从而在不降低识别旧类能力的情况下有效地学习新类。结果表明,无论在小规模还是大规模数据集上,MiB都大幅优于基于正则化的增量类学习(ICL)方法。我们相信,我们对问题的表述、所采用的方法以及与先前方法的广泛比较,将推动未来对这一新颖研究主题的探索,特别是在语义分割的ICL模型中有效纳入背景类语义偏移的方向上。

In Sections 3.3 and Sections 3.4, we focused on the multi-domain and incremental learning problem respectively, incrementally adding new semantic task/concepts to a pre-trained model. However, in both these tasks, the underlying assumption is that the images will contain only objects we have seen during training or that we can safely consider as background. A more realistic problem is equipping models with the ability to not only recognizing semantic concepts and incrementally learn new ones, but also detecting if an image contains a previously unseen semantic category. In the next section, we will show how we can address this problem in the framework of open-world recognition.
在3.3节和3.4节中,我们分别聚焦于多领域和增量学习问题,逐步向预训练模型中添加新的语义任务/概念。然而,在这两个任务中,潜在的假设是图像仅包含我们在训练期间见过的对象,或者我们可以安全地将其视为背景的对象。一个更现实的问题是,使模型不仅具备识别语义概念并逐步学习新语义概念的能力,还能检测图像中是否包含先前未见过的语义类别。在下一节中,我们将展示如何在开放世界识别的框架下解决这个问题。

3.5 Open World Recognition 78
3.5 开放世界识别 78


In the previous sections, we have discussed how new knowledge in terms of classification tasks (Section 3.3) and semantic concepts (Section 3.4) can be added to a pre-trained model. In particular, in Section 3.4, we showed how it is possible to have a model whose output space contains all the concepts incrementally learned by the model. However, all the models discussed so-far rely on a simple assumption: all the categories we are interested in recognize are contained in our output space. This closed-world assumption (CWA) is unrealistic for agents acting in the real-world. Indeed it is impossible to capture all existing semantic concepts in a single training set unless we are in a very constrained scenario. In this section, we take a step forward and we show how we can break the CWA developing two visual systems able to work in the open world.
在前面的章节中,我们讨论了如何将分类任务(3.3节)和语义概念(3.4节)方面的新知识添加到预训练模型中。特别是在3.4节中,我们展示了如何使模型的输出空间包含模型逐步学习到的所有概念。然而,到目前为止所讨论的所有模型都依赖于一个简单的假设:我们感兴趣识别的所有类别都包含在我们的输出空间中。这种封闭世界假设(CWA)对于在现实世界中运行的智能体来说是不现实的。实际上,除非处于非常受限的场景中,否则不可能在单一训练集中涵盖所有现有的语义概念。在本节中,我们向前迈进了一步,展示了如何通过开发两个能够在开放世界中工作的视觉系统来打破CWA。

To clarify our goal, let us consider the example shown in Fig. 3.8. The robot has a knowledge base composed by a limited number of classes. Given an image containing an unknown concept (e.g. banana), we want the robot to detect it as unknown and being able to add it to its knowledge base in subsequent learning stages. To accomplish this goal, it is very important for a robot vision system to have two crucial abilities: (i) it must be able to recognize already seen concepts and detect unknown ones (i.e. open set recognition), and (ii) it must be able to extend its knowledge base with new classes (i.e. incremental learning), without forgetting the already learned ones and without access to old training sets, avoiding catastrophic forgetting [175]). While open set recognition [234, 70, 136] and incremental learning [216,26,25,263] are well-studied problems in the literature,few works proposed a solution to solve them together [15,50] . Standard approaches for open world recognition (OWR) equip the nearest class mean (NCM) classification algorithm with a rejection option based on an estimated threshold. While standard approaches [15,50] use shallow features,in this section we take a step forward,proposing two deep models for open world recognition.
为了明确我们的目标,让我们考虑图3.8所示的示例。机器人拥有一个由有限数量的类别组成的知识库。给定一张包含未知概念(例如香蕉)的图像,我们希望机器人能够将其检测为未知概念,并能够在后续的学习阶段将其添加到知识库中。为了实现这一目标,机器人视觉系统具备两个关键能力非常重要:(i)它必须能够识别已经见过的概念并检测未知概念(即开放集识别);(ii)它必须能够在不遗忘已经学习过的类别且无需访问旧训练集的情况下,用新类别扩展其知识库(即增量学习),避免灾难性遗忘 [175]。虽然开放集识别 [234, 70, 136] 和增量学习 [216,26,25,263] 是文献中研究得比较充分的问题,但很少有工作提出同时解决这两个问题的方案 [15,50]。开放世界识别(OWR)的标准方法是为最近类均值(NCM)分类算法配备一个基于估计阈值的拒绝选项。虽然标准方法 [15,50] 使用浅层特征,但在本节中,我们向前迈进了一步,提出了两个用于开放世界识别的深度模型。

The first model we will discuss builds on recent work by Guerriero et al. [95] and is (up to our knowledge) the first deep open world recognition architecture in the literature. This approach couples the flexibility of non-parametric classification methods, necessary to add incrementally new classes over time and able to estimate a probability score for each known class supporting the detection of new classes (Nearest Non Outlier, NNO [15]), with the powerful intermediate representations learned by deep networks. We enable end-to-end training of the architecture through an online approximate estimate and update function for the mean prototype representing each known class and for the threshold allowing to detect novel classes in a life-long learning fashion. We name this approach DeepNNO (Deep Nearest Non-Outlier)[167].
我们将讨论的第一个模型基于Guerriero等人 [95] 的近期工作,据我们所知,它是文献中第一个深度开放世界识别架构。这种方法将非参数分类方法的灵活性(这对于随时间逐步添加新类别是必要的,并且能够为每个已知类别估计一个概率得分以支持检测新类别,即最近非离群点(NNO) [15])与深度网络学习到的强大中间表示相结合。我们通过对代表每个已知类别的均值原型以及允许以终身学习方式检测新类别的阈值进行在线近似估计和更新函数,实现了该架构的端到端训练。我们将这种方法命名为DeepNNO(深度最近非离群点) [167]。

The second model improves DeepNNO by forcing the deep architecture used as feature extractor to cluster appropriately samples belonging to the same class, while pushing away samples of other classes. For this reason, it introduces a global clustering loss term that aims at keeping closer the features of samples belonging to the same class to their class centroid. Furthermore, we show how the soft nearest neighbor loss [230, 74] can be successfully employed as a local clustering loss term in order to force pair of samples of the same class to be closer in the learned metric space than relative sample points of other classes. Moreover, differently from DeepNNO and previous shallow works [15] we avoid to estimate a global rejection threshold on the model predictions based on heuristic rules but we (i) define an independent threshold for each class and (ii) we explicitly learn the thresholds by using a margin-based loss function which balances rejection errors on samples of a reserved memory held-out from the training. We name this approach B-DOC (B-DOC) [69].
第二种模型对深度最近邻分类器(DeepNNO)进行了改进,它迫使作为特征提取器的深度架构对属于同一类别的样本进行适当聚类,同时将其他类别的样本推开。为此,它引入了一个全局聚类损失项,旨在使属于同一类别的样本特征与其类质心更接近。此外,我们展示了如何成功地将软最近邻损失[230, 74]用作局部聚类损失项,以使同一类别的样本对在学习到的度量空间中比其他类别的相对样本点更接近。而且,与深度最近邻分类器(DeepNNO)和之前的浅层方法[15]不同,我们避免基于启发式规则在模型预测上估计全局拒绝阈值,而是(i)为每个类别定义一个独立的阈值,并且(ii)通过使用基于边界的损失函数来明确学习这些阈值,该损失函数平衡了在训练中保留的预留内存样本上的拒绝误差。我们将这种方法命名为B-DOC(边界深度开放类别识别,Bounded Deep Open Category Recognition)[69]。


7 M. Mancini,H. Karaoguz,E. Ricci,P. Jensfelt,B. Caputo. Knowledge is Never Enough: Towards Web Aided Deep Open World Recognition. IEEE International Conference on Robotics and Automation (ICRA) 2019.
7 M. 曼奇尼、H. 卡拉奥古兹、E. 里奇、P. 延斯费尔特、B. 卡普托。知识永无止境:迈向网络辅助的深度开放世界识别。IEEE国际机器人与自动化会议(ICRA)2019。

8 D. Fontanel,F. Cermelli,M. Mancini,S. Rota Buló,E. Ricci,B. Caputo. Boosting Deep Open World Recognition by Clustering. IEEE Robotics and Automation Letters 2020.
8 D. 丰塔内尔、F. 切尔梅利、M. 曼奇尼、S. 罗塔·布卢、E. 里奇、B. 卡普托。通过聚类提升深度开放世界识别能力。IEEE机器人与自动化快报2020。




https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_132.jpg?x=275&y=264&w=1099&h=451&r=0

Figure 3.8. In the open-world scenario a robot must be able to classify correctly known objects, (apple and mug), and detect novel semantic concepts (e.g. banana). When a novel concept is detected, it should learn the new class from an auxiliary dataset, updating its internal knowledge.
图3.8。在开放世界场景中,机器人必须能够正确分类已知物体(苹果和杯子),并检测新的语义概念(例如香蕉)。当检测到新的概念时,它应该从辅助数据集中学习新的类别,更新其内部知识。


We evaluate DeepNNO and B-DOC on Core50 [151], RGB-D Object Dataset [127] and CIFAR-100 [123] datasets, showing experimentally that DeepNNO outperforms previous OWR methods and B-DOC show increased effectiveness in both detecting novel classes and adding new classes to the set of known ones.
我们在Core50 [151]、RGB - D物体数据集[127]和CIFAR - 100 [123]数据集上评估了深度最近邻分类器(DeepNNO)和B - DOC,实验表明深度最近邻分类器(DeepNNO)优于之前的开放世界识别(OWR)方法,并且B - DOC在检测新类别和将新类别添加到已知类别集合中都显示出更高的有效性。

The outline of this section is as follows. We start by giving a more formal definition of the OWR problem (Section 3.5.1) and some preliminaries on the NCM [177,95] and NNO [15,50] algorithms which serve as starting point for our approaches (Section 3.5.2). We then describe DeepNNO (Section 3.5.2) and B-DOC (Section 3.5.2), showing their results on the aforementioned benchmarks (Section 3.5.2). We conclude by providing a perspective toward autonomous visual systems with preliminary experiments on Web-aided OWR (Section 3.5.6) and the conclusions (Section 3.5.7).
本节的大纲如下。我们首先对开放世界识别(OWR)问题给出更正式的定义(第3.5.1节),并介绍最近类均值(NCM)[177,95]和最近邻离群点(NNO)[15,50]算法的一些预备知识,这些算法是我们方法的起点(第3.5.2节)。然后我们描述深度最近邻分类器(DeepNNO)(第3.5.2节)和B - DOC(第3.5.2节),并展示它们在上述基准测试中的结果(第3.5.2节)。最后,我们通过对网络辅助的开放世界识别(OWR)进行初步实验,为自主视觉系统提供一个展望(第3.5.6节),并给出结论(第3.5.7节)。

3.5.1 Problem Formulation
3.5.1 问题表述


The goal of OWR is producing a model capable of (i) recognizing known concepts (i.e. classes seen during training), (ii) detecting unseen categories (i.e. classes not present in any training set used for training the model) and (iii) incrementally add new classes as new training data is available. Formally,let us denote as X and Y the input space (i.e. image space) and the closed world output space respectively (i.e. set of known classes). Moreover, since our output space will change as we receive new data containing novel concepts,we will denote as Yt the set of classes seen after the tth  incremental step,with Y0 denoting the category present in the first training set. Additionally, since we aim to detect if an image contains an unknown concept, in the following we will denote as unk the special unknown class, building the output space as Yt{unk} . We assume that,at each incremental step,we have access to a training set Tt={(x1t,c1t),,(xNtt,cNtt)} ,with Nt=|Tt|,xtX ,and ctCt , where Ct is the set of categories contained in the training set Tt . Note that,without loss of generality, in each incremental step, we assume to see a new set of classes CiCj= if ij . The set of known classes at step t is computed as Yt=i=0tCi and given a sequence of S incremental steps,our goal is to learn a model mapping input images to either their corresponding label in YS or to the special class unk. In the following we will split the classification model into two components: a feature extractor f that maps the samples into a feature space and a classifier g that maps the features into a class label,i.e. g(f(x))=c with c{YS,unk} .
开放世界识别(Open World Recognition,OWR)的目标是生成一个能够实现以下功能的模型:(i)识别已知概念(即训练期间见过的类别);(ii)检测未见类别(即用于训练模型的任何训练集中都不存在的类别);(iii)随着新训练数据的出现逐步添加新类别。形式上,我们分别用XY表示输入空间(即图像空间)和封闭世界输出空间(即已知类别集合)。此外,由于我们的输出空间会随着接收到包含新的概念的新数据而发生变化,我们用Yt表示在第tth 次增量步骤后看到的类别集合,其中Y0表示第一个训练集中存在的类别。另外,由于我们旨在检测图像是否包含未知概念,在下面我们将用unk表示特殊的未知类别,将输出空间构建为Yt{unk}。我们假设,在每次增量步骤中,我们可以访问一个训练集Tt={(x1t,c1t),,(xNtt,cNtt)},其中Nt=|Tt|,xtXctCt,其中Ct是训练集Tt中包含的类别集合。请注意,不失一般性,在每次增量步骤中,如果ij,我们假设会看到一组新的类别CiCj=。在第t步的已知类别集合计算为Yt=i=0tCi,给定S次增量步骤的序列,我们的目标是学习一个模型,将输入图像映射到YS中其对应的标签或特殊类别unk。在下面,我们将分类模型分为两个组件:一个特征提取器f,它将样本映射到特征空间;一个分类器g,它将特征映射到类别标签,即g(f(x))=c,其中c{YS,unk}

3.5.2 Preliminaries
3.5.2 预备知识


Standard approaches to tackle the OWR problem apply non-parametric classification algorithms on top of learned metric spaces [15,50] . A common choice for the classifier g is the Nearest Class Mean (NCM) [177,95]. NCM works by computing a centroid for each class (i.e. the mean feature vector) and assigning a test sample to the closest centroid in the learned metric space. Formally, we have:
解决开放世界识别(OWR)问题的标准方法是在学习到的度量空间[15,50]之上应用非参数分类算法。分类器g的一个常见选择是最近类均值(Nearest Class Mean,NCM)[177,95]。NCM的工作原理是为每个类别计算一个质心(即平均特征向量),并将测试样本分配到学习到的度量空间中最近的质心。形式上,我们有:

(3.13)gNCM(x)=argmincCtd(f(x),μc)

where d(,) is a distance function (e.g. Euclidean) and μc is the mean feature vector for class c . The standard NCM formulation cannot be applied in the OWR setting since it lacks the inherent capability of detecting images belonging to unknown categories. To this extent, in [15] the authors extend the NCM algorithm to the OWR setting by defining a rejection criterion for the unknowns. In this extension, called Nearest Non-Outlier (NNO), class scores are defined as:
其中d(,)是一个距离函数(例如欧几里得距离),μc是类别c的平均特征向量。标准的NCM公式不能应用于开放世界识别(OWR)场景,因为它缺乏检测属于未知类别的图像的内在能力。为此,在文献[15]中,作者通过为未知类别定义一个拒绝准则,将NCM算法扩展到开放世界识别(OWR)场景。在这个扩展中,称为最近非离群点(Nearest Non - Outlier,NNO),类别得分定义为:

(3.14)scNNO(x)=Z(1d(f(x),μc)τ),

where τ is the rejection threshold and Z is a normalization factor. The final classification is held-out as:
其中τ是拒绝阈值,Z是一个归一化因子。最终的分类结果为:

(3.15)g(x)={unk if scNNO(x)0cYt,gNCM(x) otherwise. 

Following [177], in [15] the features are linearly projected into a metric space defined by a matrix W (i.e. f(x)=Wx ),with W learned on the first training set T0 and kept fixed during the successive learning steps. The main limitation of this approach is that new knowledge will be incorporated in the classifier g without updating the feature extractor f accordingly. In the next section,we show how the performance of NNO can be significantly improved by using as f a deep architecture trained end-to-end in each incremental step.
根据文献[177],文献[15]将特征线性投影到由矩阵W定义的度量空间中(即f(x)=Wx),其中W是在第一个训练集T0上学习得到的,并且在后续的学习步骤中保持固定。这种方法的主要局限性在于,新知识将被纳入分类器g中,而特征提取器f不会相应地更新。在下一节中,我们将展示如何通过在每个增量步骤中使用端到端训练的深度架构作为f来显著提高最近非离群点(NNO)算法的性能。

3.5.3 Deep Nearest Non-Outlier
3.5.3 深度最近非离群点


The DeepNNO algorithm is obtained from NNO with the following modifications: (i) the feature extractor function is replaced with deep representations derived from neural network layers; (ii) an online update strategy is adopted for the mean vectors μc ; (iii) an appropriate loss is optimized using stochastic gradient descent (SGD) methods in order to compute the feature representations and the associated class specific means.
深度最近非离群点(DeepNNO)算法是在最近非离群点(NNO)算法的基础上进行了以下修改得到的:(i)用从神经网络层得到的深度表示替换特征提取器函数;(ii)对均值向量μc采用在线更新策略;(iii)使用随机梯度下降(SGD)方法优化适当的损失函数,以计算特征表示和相关的特定类别的均值。

First, inspired by the recent work [95], we replace the feature extractor function f() with deep representations derived from a neural network fθ() and define the class-specific probability scores as follows:
首先,受近期文献[95]的启发,我们用从神经网络fθ()得到的深度表示替换特征提取器函数f(),并将特定类别的概率得分定义如下:

(3.16)scDNNO(x)=exp(12fθ(x)μc).

Note that,differently from [15],we do not consider explicitly the matrix W since this is replaced by the network parameters θ . Furthermore,we avoid to use a clamping function as this could hamper the gradient flow within the network. This formulation is similar to the NNO version proposed in [50] which have been showed to be more effective than that in [15] for online scenarios.
请注意,与文献[15]不同,我们没有明确考虑矩阵W,因为它被网络参数θ所取代。此外,我们避免使用钳位函数,因为这可能会阻碍网络内的梯度流动。这种公式化方法类似于文献[50]中提出的NNO版本,在在线场景中,该版本已被证明比文献[15]中的版本更有效。

In OWR the classification model must be updated as new samples arrive. In DeepNNO this translates into incrementally updating the feature representations fθ(x) and defining an appropriate strategy for updating the class mean vectors. Given a mini-batch of samples B={(x1,c1),,(xb,cb)} ,we compute the mean
在持续学习(OWR)中,随着新样本的到来,分类模型必须进行更新。在DeepNNO中,这意味着逐步更新特征表示fθ(x),并定义一种适当的策略来更新类均值向量。给定一个小批量样本B={(x1,c1),,(xb,cb)},我们计算均值

vectors through:
向量的方法如下:

(3.17)μct+1=ncμct+nc,BμcBnc+nc,B

where nc represents the number of samples belonging to class c seen by the network until the current update step t,nc,B represents the number of samples belonging to class c in the current batch and μcB represents the current mini-batch mean vector relative to the features of class c .
其中nc表示直到当前更新步骤网络所看到的属于类c的样本数量,t,nc,B表示当前批次中属于类c的样本数量,μcB表示相对于类c的特征的当前小批量均值向量。

Given the class-probability scores in DeepNNO we define the following prediction function:
根据DeepNNO中的类概率得分,我们定义以下预测函数:

(3.18)c={unk if scDNNO(x)ΔcYtargmaxcYtscDNNO(x) otherwise 

where Δ is a threshold which,similarly to the parameter τ in Eqn.(3.14),regulates the number of samples that are assigned to a new class. While in [15] τ is a user defined parameter which is kept fixed, in this subsection we argue that a better strategy is to dynamically update Δ since the feature extractor function and the mean vectors change during training. Intuitively, while training the deep network, an estimate of Δ can be obtained by looking at the probability score given to the ground truth class. If the score is higher than the threshold,the value of Δ can be increased. Oppositely, the value of the threshold is decreased if the prediction is rejected. Specifically,given a mini-batch B we update Δ as follows:
其中Δ是一个阈值,类似于公式(3.14)中的参数τ,它调节被分配到新类别的样本数量。在文献[15]中,τ是一个用户定义的固定参数,在本小节中,我们认为更好的策略是动态更新Δ,因为特征提取器函数和均值向量在训练过程中会发生变化。直观地说,在训练深度网络时,可以通过查看真实类别的概率得分来获得Δ的估计值。如果得分高于阈值,则可以增加Δ的值。相反,如果预测被拒绝,则降低阈值的值。具体来说,给定一个小批量B,我们按如下方式更新Δ
(3.19)Δt+1=1t+1(tΔt+1CBcYts¯c,BDNNO)

where CB is the number of classes in Yt represented by at least one sample in B and s¯c,BDN0 is the weighted average probability score of instances of class c within the batch. Formally we consider:
其中CBYt中在B中至少由一个样本表示的类的数量,s¯c,BDN0是批次中类c的实例的加权平均概率得分。形式上,我们考虑:

(3.20)s¯c,BDNNO=1ηB,ki=1bwc,iscDNNO(xi)

where ηB,k=i=1bwc,i is a normalization factor and:
其中ηB,k=i=1bwc,i是一个归一化因子,并且:

(3.21)wc,i={w+if ci=cscDNNO(xi)>Δwif ci=cscDNNO(xi)Δ0 otherwise 

where w and w+ are scalar parameters which allow to assign different importance to samples for which the scores given to the ground truth class are respectively rejected or not by the current threshold Δ .
其中 ww+ 是标量参数,它们允许为真实类别得分分别被当前阈值 Δ 拒绝或未被拒绝的样本分配不同的重要性。

To train the network, we employ standard SGD optimization, minimizing the binary cross entropy loss over the training set:
为了训练网络,我们采用标准的随机梯度下降(SGD)优化方法,在训练集上最小化二元交叉熵损失:

(3.22)L=1|Tt|iCL(xi,ci)

where:
其中:

(3.23)CL(xi,ci)=logsciDNNO(xi)cYt1ccilog(1scDNNO(xi))

After computing the loss, we use standard backpropagation to update the network parameters θ . After updating θ ,we use the samples of the current batch to update both the class mean estimates μc and the threshold Δ ,using Eqn. (3.17) and Eqn.(3.19) respectively.
计算损失后,我们使用标准的反向传播算法来更新网络参数 θ。更新 θ 后,我们使用当前批次的样本分别根据公式 (3.17) 和公式 (3.19) 更新类别均值估计 μc 和阈值 Δ

To allow our model for incremental learning of our deep neural network, we exploit two additional components. Following standard rehearsal-based approaches for incremental learning [216,36,30] ,the first is a memory which stores the most relevant samples of each class in Yt . The relevance of a sample(x,k)is determined by its distance dc(x) to the class mean μc i.e. the lower is the distance,the higher is the relevance of the sample. The memory is used to augment the training set Tt+1 , allowing to update the mean estimates of the classes in Yt as the network is trained using samples of novel ones. In order to avoid an unbounded growth, the size of the memory is kept fixed and it is pruned after each incremental step to make room for instances of novel classes. The pruning is performed by removing, for each class in Yt ,the instances with lowest relevance.
为了让我们的深度神经网络模型能够进行增量学习,我们利用了两个额外的组件。遵循基于标准排练的增量学习方法 [216,36,30],第一个组件是一个存储器,它在 Yt 中存储每个类别的最相关样本。样本 (x,k) 的相关性由其到类别均值 μc 的距离 dc(x) 决定,即距离越小,样本的相关性越高。该存储器用于扩充训练集 Tt+1,使得在使用新类别样本训练网络时能够更新 Yt 中类别的均值估计。为了避免无限制增长,存储器的大小保持固定,并且在每个增量步骤后进行修剪,以便为新类别的实例腾出空间。修剪操作是通过移除 Yt 中每个类别相关性最低的实例来完成的。

The second component is a batch sampler which makes sure that a given ratio of the batch is composed by samples taken from the memory, independently from the memory size. This allows to avoid biasing the incremental learning procedure towards novel categories, in the case their number of samples is much larger than the memory size. Additionally, we add a distillation loss [102] which act as regularizer and avoids the forgetting of previously learned features. Denoting as fθYt1 the network trained on the set of known classes, the distillation loss is defined as:
第二个组件是一个批次采样器,它确保批次中给定比例的样本来自存储器,而与存储器的大小无关。这可以避免在新类别样本数量远大于存储器大小时,将增量学习过程偏向新类别。此外,我们添加了一个蒸馏损失 [102],它作为正则化项,避免遗忘先前学习的特征。将在已知类别集合上训练的网络表示为 fθYt1,蒸馏损失定义为:
(3.24)DS(xi)=fθ(x)fθYt1(x)

The overall loss is thus defined as:
因此,总体损失定义为:

(3.25)LDNNO=1|Tt|i(CL(xi,ci)+λDS(xi))

where λ is an hyperparameter balancing the contribution of distill  within L .
其中 λ 是一个超参数,用于平衡 distill L 中的贡献。

3.5.4 Boosting Deep Open World Recognition
3.5.4 提升深度开放世界识别能力


Despite its experimental effectiveness (see Section 3.5.5), DeepNNO has two main drawbacks. First,the learned feature representation f is not forced to produce predictions clearly localized in a limited region of the metric space. Indeed, constraining the feature representations of a given class to a limited region of the metric space allows to have both more confident predictions on seen classes and producing clearer rejections also for images of unseen concepts. Second, having an heuristic strategy for setting the threshold is sub-optimal with no guarantees on the robustness of the choice. In the following, we will detail how we provide solutions to both problems in B-DOC .
尽管深度神经网络开放世界识别(DeepNNO)在实验中有效(见 3.5.5 节),但它有两个主要缺点。首先,学习到的特征表示 f 并未被强制在度量空间的有限区域内产生明确的预测。实际上,将给定类别的特征表示约束在度量空间的有限区域内,既可以对已见类别做出更有信心的预测,也能对未见概念的图像做出更明确的拒绝。其次,采用启发式策略设置阈值并非最优,且无法保证选择的鲁棒性。接下来,我们将详细介绍在 B - DOC 中如何解决这两个问题。

To obtain feature representations clearly localized in the metric space based on their semantic, we propose to use a pair of losses enforcing clustering. In particular, we use a global term which forces the network to map samples of the same class close to their centroid (Fig.3.9, left) and a local clustering term which constrains the neighborhood of a sample to be semantically consistent, i.e. to contain samples of the same class (Fig.3.9, right). In the following we describe the two clustering terms.
为了获得基于语义在度量空间中明确定位的特征表示,我们建议使用一对强制聚类的损失函数。具体来说,我们使用一个全局项,它迫使网络将同一类别的样本映射到其质心附近(图 3.9,左),以及一个局部聚类项,它约束样本的邻域在语义上保持一致,即包含同一类别的样本(图 3.9,右)。下面我们将描述这两个聚类项。

Global Clustering. The global clustering term aims to reduce the distance between the features of a sample with the centroid of its class. To model this, we took inspiration from what has been proposed in [177] and we employ a cross-entropy loss with the probabilities obtained through the distances among samples and class centroids. Formally,given a sample x and its class label c ,we define the global clustering term as follows:
全局聚类。全局聚类项旨在减小样本特征与其所属类别质心之间的距离。为了对其进行建模,我们借鉴了文献 [177] 中的提议,并采用通过样本与类别质心之间的距离获得的概率的交叉熵损失。形式上,给定一个样本 x 及其类别标签 c,我们将全局聚类项定义如下:

(3.26)GC(x,c)=logscBDOC(x)kYtsc(x).

The class-specific score sc(x) is defined as:
特定类别的得分 sc(x) 定义为:

(3.27)scBDOC(x)=e1Tfθ(x)μc2kCte1Tfθ(x)μk2

where T is a temperature value which allows us to control the behavior of the classifier. We set T as the variance of the activations in the feature space, σ2 ,in order to normalize the representation space and increase the stability of the system. During training, σ2 is the variance of the features extracted from the current batch while,at the same time,we keep an online global estimate of σ2 that we use at test time. The class mean vectors μi with iYt as well as σ2 are computed in an online fashion, as in DeepNNO.
其中 T 是一个温度值,它使我们能够控制分类器的行为。我们将 T 设置为特征空间中激活值的方差 σ2,以便对表示空间进行归一化并提高系统的稳定性。在训练期间,σ2 是从当前批次中提取的特征的方差,同时,我们会在线全局估计 σ2,并在测试时使用该估计值。与 iYt 对应的类均值向量 μi 以及 σ2 会像在 DeepNNO 中一样以在线方式进行计算。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_137.jpg?x=277&y=283&w=1093&h=543&r=0

Figure 3.9. Overview of the B-DOC global to local clustering. The global clustering (left) pushes sample representations closer to the centroid (star) of the class they belong to. The local clustering (right), instead, forces the neighborhood of a sample in the representation space to be semantically consistent, pushing away samples of other classes.
图 3.9. B - DOC 全局到局部聚类概述。全局聚类(左)使样本表示更接近其所属类别的质心(星号)。相反,局部聚类(右)迫使表示空间中样本的邻域在语义上保持一致,将其他类别的样本推开。


Local Clustering. To enforce that the neighborhood of a sample in the feature space is semantically consistent (i.e. given a sample x of a class c ,the nearest neighbours of f(x) belong to c ),we employ the soft nearest neighbour loss [230,74]. This loss has been proposed to measure the class-conditional entanglement of features in the representation space. In particular, it has been defined as:
局部聚类。为了确保特征空间中样本的邻域在语义上保持一致(即,给定类别 c 的样本 xf(x) 的最近邻属于 c),我们采用软最近邻损失 [230,74]。该损失被提出用于衡量表示空间中特征的类条件纠缠程度。具体而言,它被定义为:

(3.28)LC(x,c,B)=logxjBc{x}xjB{x}xkB{x}e1Tfθ(x)fθ(xj)2xkB{x}

where T refers to the temperature value, B is the current training batch,and Bc is the set of samples in the training batch belonging to class c . Instead of performing multiple learning steps to optimize the value of T as proposed in [74],we use as T=σ2 as we do in Eq. 3.27.
其中 T 指的是温度值,B 是当前训练批次,Bc 是训练批次中属于类别 c 的样本集。我们没有像文献 [74] 中所提议的那样执行多个学习步骤来优化 T 的值,而是像在公式 3.27 中那样使用 T=σ2

Intuitively,given a sample x of a class c ,a low value of the loss indicates that the nearest neighbours of f(x) belong to c ,while high values indicates the opposite (i.e. nearest neighbours belong to classes iYt with ic ). Minimizing this objective allows to enforce the semantic consistency in the neighborhood of a sample in the feature space.
直观地说,给定类别 c 的样本 x,损失值较低表明 f(x) 的最近邻属于 c,而损失值较高则表明相反情况(即,最近邻属于 iciYt 的类别)。最小化该目标可以确保特征空间中样本邻域的语义一致性。

https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_138.jpg?x=433&y=274&w=780&h=525&r=0

Figure 3.10. Overview of how B-DOC learns the class-specific rejection thresholds. The small circles represent the samples in the held out set. The dashed circles, having radius the maximal distance (red), represent the limits beyond which a sample is rejected as a member of that class. As it can be seen, the class-specific threshold is learned to reduce the rejection errors.
图 3.10. B - DOC 学习特定类别拒绝阈值的概述。小圆圈表示保留集中的样本。虚线圆圈的半径为最大距离(红色),表示样本被拒绝作为该类别成员的界限。可以看出,学习特定类别的阈值是为了减少拒绝错误。


Knowledge distillation and full objective. As highlighted in the Section 3.5.3, to avoid forgetting old knowledge, we want the feature extractor to preserve the behaviour learned in previous learning steps. To this extent, as in DeepNNO, we introduce (i) a memory which stores the most relevant samples for classes in Yt and (ii) a distillation loss which enforces consistency among the features extracted by f and ones obtained by the feature extractor of the previous learning step, ft1 . The distillation loss is computed as in Eq. (3.24). As before, this loss is minimized only for incremental training steps,hence,only when t>1 . Additionally,we apply also the same balanced batch sampling scheme of DeepNNO.
知识蒸馏与完整目标。如第 3.5.3 节所述,为了避免遗忘旧知识,我们希望特征提取器保留在先前学习步骤中习得的行为。为此,与 DeepNNO 一样,我们引入 (i) 一个存储器,用于存储 Yt 中各类别的最相关样本;(ii) 一个蒸馏损失,用于确保 f 提取的特征与前一个学习步骤的特征提取器 ft1 获得的特征之间的一致性。蒸馏损失的计算方式如公式 (3.24) 所示。与之前一样,仅在增量训练步骤中(即仅当 t>1 时)最小化该损失。此外,我们还采用了与 DeepNNO 相同的平衡批次采样方案。

Overall,given a batch of samples B={(x1,c1),,(xb,cb)} ,we train the network to minimize the following loss:
总体而言,给定一批样本 B={(x1,c1),,(xb,cb)},我们训练网络以最小化以下损失:

(3.29)LBDOC=1|B|(x,c)BGC(x,c)+λLC(x,c,B)+γDS(x)

with λ and γ hyperparameters weighting the different components.
其中 λγ 是对不同组件进行加权的超参数。

Learning to detect the unknown. In order to extend the NCM-based classifier of B-DOC to work on the open set scenario, we explicitly learn class-specific rejection criterions. As illustrated in Fig. 3.10,for each class c we define the class-specific threshold as the maximal distance Δc for which the sample belongs to c . Under this definition, the B-DOC classifier is:
学习检测未知类别。为了将基于 NCM 的 B - DOC 分类器扩展到开放集场景中,我们明确学习特定类别的拒绝准则。如图 3.10 所示,对于每个类别 c,我们将特定类别的阈值定义为样本属于 c 的最大距离 Δc。在此定义下,B - DOC 分类器为:

(3.30)g(x)={unk if d(fθ(x),μc)>Δc,cYt,argmincd(fθ(x),μc) otherwise 

with d(x,y)=1σ2xy2 . Instead of heuristically estimating or fixing a maximal distance,we explicitly learn it for each class be freezing the feature extractor fθ and minimizing the following objective over the thresholds Δc :
其中 d(x,y)=1σ2xy2 。我们没有通过启发式方法估计或固定最大距离,而是通过冻结特征提取器 fθ 并在阈值 Δc 上最小化以下目标函数,为每个类别显式地学习该距离:
(3.31)MDΔ(x,c)=kYtmax(0,m(1σ2fθ(x)μk2Δk))

where m=1 if c=k and m=1 otherwise. The MDΔ loss leads to an increase of Δc if the distance from a sample belonging to the class c and the class centroid μc is greater than Δc . Instead,if a sample not belonging to c has a distance from μc less then Δc ,it increases the value of Δc .
其中,如果 c=km=1 ,否则 m=1 。如果属于类别 c 的样本与类别质心 μc 之间的距离大于 Δc ,则 MDΔ 损失会导致 Δc 增大。相反,如果不属于 c 的样本与 μc 之间的距离小于 Δc ,则会增大 Δc 的值。

Overall, the training procedure of B-DOC is made of two steps: in the first we train the feature extractor on the training set minimizing Eq. 3.29, while in the second we learn the distances Δc on a set of samples which we held-out from training set. To this extent, we split the samples of the memory in two parts, one used for updating the feature extractor f and the centroids μc and the other part for learning the Δc values.
总体而言,B - DOC的训练过程分为两个步骤:第一步,我们在训练集上训练特征提取器,最小化公式3.29;第二步,我们在从训练集中预留出的一组样本上学习距离 Δc 。为此,我们将内存中的样本分为两部分,一部分用于更新特征提取器 f 和质心 μc ,另一部分用于学习 Δc 的值。

3.5.5 Experimental results
3.5.5 实验结果


In this subsection, we first introduce the experimental setting and the metrics used for the evaluation, then we report results of DeepNNO and B-DOC, showing ablation studies for each of their components.
在本小节中,我们首先介绍实验设置和用于评估的指标,然后报告DeepNNO和B - DOC的实验结果,并展示对它们各个组件的消融研究。

Datasets and Baselines. We assess the performance of our models on three datasets: RGB-D Object [127] Core50 [151] and CIFAR-100 [123]. The RGB-D Object dataset [127] is one of the most used dataset to evaluate the ability of a model to recognize daily-life objects. It contains 51 different semantic categories that we split in two parts in our experiments: 26 classes are considered as known categories, while the other 25 are the set of unknown classes. Among the 26 classes, we consider the first 11 classes as the initial training set and we incrementally add the remaining classes in 4 steps of 5 class each. As proposed in [127], we sub-sample the dataset taking one every fifth frame. For the experiments, we use the first train-test split among the original ones defined by the authors [127]. In each split one object instance from each class is chosen to be used in the test set and removed from the training set. This split provides nearly 35,000 training images and 7,000 test images.
数据集和基线。我们在三个数据集上评估我们模型的性能:RGB - D Object [127]、Core50 [151] 和CIFAR - 100 [123]。RGB - D Object数据集 [127] 是评估模型识别日常生活物体能力最常用的数据集之一。它包含51个不同的语义类别,在我们的实验中,我们将其分为两部分:26个类别被视为已知类别,另外25个类别为未知类别。在这26个类别中,我们将前11个类别作为初始训练集,并分4步逐步添加剩余的类别,每次添加5个类别。按照文献 [127] 的建议,我们对数据集进行子采样,每5帧选取1帧。在实验中,我们使用作者 [127] 定义的原始训练 - 测试划分中的第一个划分。在每个划分中,从每个类别中选择一个物体实例用于测试集,并将其从训练集中移除。这种划分提供了近35,000张训练图像和7,000张测试图像。

Core50 [151] is a recently introduced benchmark for testing continual learning methods in an egocentric setting. The dataset contains images of 50 objects grouped into 10 semantic categories. The images have been acquired on 11 different sequences with varying conditions. Following the standard protocol described in [151], we select the sequences3,7,10for the evaluation phase and use the remaining ones to train the model. Due to these differences in conditions between the sequences, Core50 represents a very challenging benchmark for object recognition. As in the RGB-D Object dataset, we split it into two parts: 5 classes are considered known and the other 5 as unknown. In the known set, the first 2 classes are considered as the initial training set. The others are incrementally added 1 class at a time.
Core50 [151] 是最近推出的一个用于在以自我为中心的场景下测试持续学习方法的基准数据集。该数据集包含50个物体的图像,这些物体分为10个语义类别。图像是在11个不同的序列中、不同的条件下采集的。按照文献 [151] 中描述的标准协议,我们选择序列3、7、10用于评估阶段,使用其余序列来训练模型。由于序列之间条件的差异,Core50是一个极具挑战性的物体识别基准数据集。与RGB - D Object数据集一样,我们将其分为两部分:5个类别被视为已知类别,另外5个类别为未知类别。在已知类别集合中,前2个类别被视为初始训练集。其余类别每次递增添加1个。

CIFAR-100 [123] is a standard benchmark for comparing incremental class learning algorithms [216]. It contains 100 different semantic categories. We split the dataset into 50 known and 50 unknown classes and considering 20 classes as the initial training set. Then, we incrementally add the remaining ones in steps of 10 classes. We evaluate the performance of DeepNNO and B-DOC in the OWR scenario comparing it with NNO [15], using the simplified implementation in [50]. We further compare our methods with two standard incremental class learning algorithms, namely LwF [144] (in the MC variant of [216]) and iCaRL [216]. Both LwF and iCaRL are designed for the closed world scenario, thus we use their performances as reference in that setting, without open-ended evaluation. For each dataset, we have randomly chosen five different sets of known and unknown classes. After fixing them, we run the experiments three times for each method. The results are obtained by averaging the results among each run and order.
CIFAR - 100 [123] 是用于比较增量类学习算法 [216] 的标准基准数据集。它包含100个不同的语义类别。我们将数据集分为50个已知类别和50个未知类别,并将20个类别作为初始训练集。然后,我们每次递增添加10个类别。我们在OWR场景下评估DeepNNO和B - DOC的性能,并将其与NNO [15] 进行比较,使用文献 [50] 中的简化实现。我们还将我们的方法与两种标准的增量类学习算法进行比较,即LwF [144](采用文献 [216] 中的MC变体)和iCaRL [216]。LwF和iCaRL都是为封闭世界场景设计的,因此我们在该场景下将它们的性能作为参考,不进行开放式评估。对于每个数据集,我们随机选择五组不同的已知和未知类别。确定这些类别后,我们对每种方法进行三次实验。结果是通过对每次运行和顺序的结果进行平均得到的。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_140.jpg?x=283&y=294&w=1088&h=920&r=0

Figure 3.11. Comparison of NNO [15], DeepNNO and B-DOC on RGB-D Object dataset [127]. The numbers in parenthesis denote the average accuracy among the different incremental steps.
图3.11. NNO [15]、DeepNNO和B - DOC在RGB - D物体数据集 [127] 上的比较。括号中的数字表示不同增量步骤中的平均准确率。


Networks architectures and training protocols. We use a ResNet-18 architecture [98] for all the experiments. For RGB-D Object dataset and Core50, we train the network from scratch on the initial classes for 12 epochs and for 4 epochs in the incremental steps. For CIFAR-100, instead, we set the epochs to 120 for the initial learning stage and to 40 for each incremental step. In the case of NNO we use the features extracted from the pre-trained network to compute the class-specific mean vectors of novel categories,but we do not update the weight matrix W and the threshold parameter τ ,as in [15]. For DeepNNO we use an initial learning rate of 1.0 in all settings, for B-DOC we use a learning rate of 0.1 for the RGB-D Object dataset and CIFAR-100, and 0.01 for Core50, with a batch size of 128 for RGB-D Object dataset and of 64 for CIFAR-100 and Core50. We train the networks using SGD with momentum 0.9 and a weight decay of 103 on all datasets. We resize the images of RGB-D Object dataset to 64×64 pixels,the ones of CIFAR-100 to 32×32 and the images of Core50 to 128×128 pixels. We perform random cropping and mirroring for all the datasets. In all experiments,we set λ=1,w+=1 and w=3 for DeepNNO,while λ=γ=1 for B-DOC . For both methods we consider a fixed size memory of 2000 samples, constructing each batch by drawing 40% of the instances from the memory. Note that, in B-DOC 20% of the samples present in the memory are never seen during training, but are used only to learn the class-specific threshold values Δc . For this set of held-out samples,we also perform color jittering varying brightness, hue and saturation.
网络架构和训练协议。我们在所有实验中使用ResNet - 18架构 [98]。对于RGB - D物体数据集和Core50,我们在初始类别上从头开始训练网络12个周期,在增量步骤中训练4个周期。对于CIFAR - 100,我们将初始学习阶段的周期设置为120,每个增量步骤设置为40。对于NNO,我们使用从预训练网络中提取的特征来计算新类别的特定类别均值向量,但我们不更新权重矩阵 W 和阈值参数 τ,如文献 [15] 所述。对于DeepNNO,我们在所有设置中使用1.0的初始学习率;对于B - DOC,我们对RGB - D物体数据集和CIFAR - 100使用0.1的学习率,对Core50使用0.01的学习率,RGB - D物体数据集的批量大小为128,CIFAR - 100和Core50的批量大小为64。我们在所有数据集上使用动量为0.9且权重衰减为 103 的随机梯度下降(SGD)来训练网络。我们将RGB - D物体数据集的图像调整为 64×64 像素,CIFAR - 100的图像调整为 32×32 像素,Core50的图像调整为 128×128 像素。我们对所有数据集进行随机裁剪和镜像操作。在所有实验中,对于DeepNNO,我们设置 λ=1,w+=1w=3,对于B - DOC设置 λ=γ=1。对于这两种方法,我们考虑一个固定大小为2000个样本的内存,通过从内存中抽取40%的实例来构建每个批次。请注意,在B - DOC中,内存中20%的样本在训练期间从未被见过,但仅用于学习特定类别的阈值 Δc。对于这组保留样本,我们还进行改变亮度、色调和饱和度的颜色抖动操作。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_141.jpg?x=283&y=293&w=1091&h=921&r=0

Figure 3.12. Comparison of NNO [15], DeepNNO and B-DOC on Core50 [151]. The numbers in parenthesis denote the average accuracy among the different incremental steps.
图3.12. NNO [15]、DeepNNO和B - DOC在Core50 [151] 上的比较。括号中的数字表示不同增量步骤中的平均准确率。


Metrics We use 3 standard metrics for comparing the performances of OWR methods. For the closed world we show the global accuracy with and without rejection option. Specifically, in the closed world without rejection setting, the model is tested only on the known set of classes, excluding the possibility to classify a sample as unknown. This scenario measures the ability of the model to correctly classify samples among the given set of classes. In the closed world with rejection scenario, instead, the model can either classify a sample among the known set of classes or categorize it as unknown. This scenario is more challenging than the previous one because samples belonging to the set of known classes might be misclassified as unknowns. For the open world we use the standard OWR metric defined in [15] as the average between the accuracy computed on the closed world with rejection scenario and the accuracy computed on the open set scenario (i.e. the accuracy on rejecting samples of unknown classes). Since the latter metric creates biases on the final score (i.e. a method rejecting every sample will achieve a 50% accuracy),we introduced the OWR-H as the harmonic mean between the accuracy on open set and the closed world with rejection scenarios to mitigate this bias.
指标 我们使用3种标准指标来比较开放世界持续学习(OWR)方法的性能。对于封闭世界,我们展示有无拒绝选项的全局准确率。具体而言,在无拒绝设置的封闭世界中,模型仅在已知类别集上进行测试,排除将样本分类为未知的可能性。此场景衡量模型在给定类别集中正确分类样本的能力。相反,在有拒绝场景的封闭世界中,模型可以将样本分类到已知类别集中,也可以将其归类为未知。此场景比前一个更具挑战性,因为属于已知类别集的样本可能会被错误分类为未知。对于开放世界,我们使用文献 [15] 中定义的标准OWR指标,即有拒绝场景的封闭世界准确率和开放集场景准确率(即拒绝未知类别样本的准确率)的平均值。由于后一个指标会对最终得分产生偏差(即一种拒绝每个样本的方法将达到 50% 的准确率),我们引入了OWR - H作为开放集准确率和有拒绝场景的封闭世界准确率的调和平均值,以减轻这种偏差。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_142.jpg?x=286&y=296&w=1077&h=448&r=0

Figure 3.13. Comparison of NNO [15], DeepNNO and B-DOC on CIFAR-100 dataset [123]. The numbers in parenthesis denote the average accuracy among the different steps.
图3.13. NNO [15]、DeepNNO和B - DOC在CIFAR - 100数据集 [123] 上的比较。括号中的数字表示不同步骤中的平均准确率。


Quantitative results
定量结果


We report the results on the RGB-D Object dataset in Fig. 3.11. Considering the closed world without rejection, reported in Fig. 3.11a. This scenario is used to asses the ability of a method to learn novel classes while preserving old knowledge, without considering the open-set scenario. As a first observation, we note that both our deep methods outperform NNO by a large margin (i.e. 9.2% DeepNNO and 14.8% B-DOC in accuracy on average), showing the importance of end-to-end trained deep representations for OWR. Remarkably, B-DOC outperforms DeepNNO by 5.6% of accuracy on average. The reason for the improvement comes from the introduction of the global and local clustering loss terms, which allows the model to better aggregate samples of the same class and to better separate them from samples of other classes. Comparing our models with the incremental class learning approaches LwF and iCaRL, we can see that both of them are highly competitive, surpassing LwF with a large gap while being either comparable (B-DOC) or slightly inferior (DeepNNO) with the more effective iCaRL. We believe these are remarkable results given that the main goal of our models is not to purely extend their knowledge over time with new concepts.
我们在图3.11中报告了RGB - D对象数据集的结果。考虑图3.11a中报告的无拒绝的封闭世界场景。此场景用于评估一种方法在保留旧知识的同时学习新类别的能力,而不考虑开放集场景。首先,我们注意到我们的两种深度方法都大幅优于最近邻分类器(NNO)(即深度最近邻分类器(DeepNNO)平均准确率高9.2%,双分支动态开放分类器(B - DOC)平均准确率高14.8%),这表明了端到端训练的深度表示对于开放世界识别(OWR)的重要性。值得注意的是,B - DOC平均准确率比DeepNNO高5.6%。这种提升的原因在于引入了全局和局部聚类损失项,这使得模型能够更好地聚合同一类别的样本,并将它们与其他类别的样本更好地分离。将我们的模型与增量类学习方法学习无遗忘(LwF)和增量分类器与表示学习(iCaRL)进行比较,我们可以看到这两种模型都极具竞争力,大幅超越了LwF,而与更有效的iCaRL相比,要么相当(B - DOC),要么略逊一筹(DeepNNO)。考虑到我们模型的主要目标并非单纯地随着时间用新概念扩展其知识,我们认为这些结果非常显著。
For what concerns the comparison on the closed world with rejection, shown in Fig. 3.11b, again DeepNNO and B-DOC surpass NNO in terms of performance. However, the results of B-DOC are remarkable, demonstrating how it is achieves higher confidence on the known classes, being able to reject a lower number of known samples. In particular, B-DOC is more confident on the first incremental steps, and obtains, on average, an accuracy of 10.3% more than DeepNNO.
关于图3.11b所示的有拒绝的封闭世界场景的比较,同样,DeepNNO和B - DOC在性能上超越了NNO。然而,B - DOC的结果非常显著,表明它如何在已知类别上实现更高的置信度,能够拒绝更少数量的已知样本。特别是,B - DOC在最初的增量步骤中更有信心,平均准确率比DeepNNO高10.3%。

The findings are confirmed also on OWR metrics. Again, both DeepNNO and B-DOC surpass NNO, showing the importance of end-to-end trained representations and updated thresholds in achieving a higher performance, even in the presence of unknowns. Even on the OWR metrics, B-DOC surpasses DeepNNO. From the results of OWR, reported in Fig. 3.11c, we see that B-DOC reaches performance similar to DeepNNO in the first steps, while it outperforms it in the latest ones. However, considering the OWR-H (Fig. 3.11d), B-DOC is better in all the incremental steps. This is because its learned rejection criterion, coupled with the clustering losses, allows B-DOC to achieve a better trade-off between the accuracy of open set and closed world with rejection. Overall, B-DOC improves on average by 4.8% and 5.2% with respect to DeepNNO in the OWR and OWR-H metrics respectively. We will provide a deeper analysis on the rejection criterion of DeepNN and B-DOC with ablation studies in the next subsections.
这些发现也在开放世界识别(OWR)指标上得到了证实。同样,DeepNNO和B - DOC都超越了NNO,表明端到端训练的表示和更新的阈值在实现更高性能方面的重要性,即使存在未知类别。即使在OWR指标上,B - DOC也超越了DeepNNO。从图3.11c报告的OWR结果来看,我们发现B - DOC在最初的步骤中达到了与DeepNNO相似的性能,而在最后的步骤中表现更优。然而,考虑开放世界识别调和指标(OWR - H)(图3.11d),B - DOC在所有增量步骤中都更优。这是因为其学习到的拒绝准则与聚类损失相结合,使得B - DOC能够在开放集和有拒绝的封闭世界的准确率之间实现更好的权衡。总体而言,B - DOC在OWR和OWR - H指标上分别比DeepNNO平均提高了4.8%和5.2%。我们将在接下来的小节中通过消融研究对DeepNN和B - DOC的拒绝准则进行更深入的分析。

In Fig. 3.12 we report the results on the Core50 [151] dataset. Similarly to the RGB-D Object dataset, DeepNNO and B-DOC achieve very competitive results with respect to incremental class learning algorithms designed for the closed world scenario,with B-DOC remarkably outperforming iCaRL by 4.7% of accuracy in the last incremental step. Similarly, B-DOC achieves a superior performance in both closed world, without and with rejection option with respect to the other OWR algorithms, outperforming NNO by 13.01% and DeepNNO by 7.74% on average in the first (Fig. 3.12a) and by more than 10% for both NNO and DeepNNO in the latter (Fig. 3.12b). In particular, it is worth noting how the challenges of Core50 (i.e. train and test acquisitions under different conditions) does not allow DeepNNO and NNO to properly model the confidence threshold, rejecting most of the sample of the known classes. Indeed, by including the rejection option the accuracy drops to 27.2% and 26.3% respectively for DeepNNO and NNO, while B-DOC reaches an average accuracy of 38.0%.
在图3.12中,我们报告了Core50 [151]数据集的结果。与RGB - D对象数据集类似,DeepNNO和B - DOC相对于为封闭世界场景设计的增量类学习算法取得了极具竞争力的结果,在最后一个增量步骤中,B - DOC的准确率显著比iCaRL高4.7%。同样,B - DOC在无拒绝和有拒绝选项的封闭世界场景中相对于其他开放世界识别(OWR)算法都取得了更优的性能,在无拒绝场景(图3.12a)中平均比NNO高13.01%,比DeepNNO高7.74%,在有拒绝场景(图3.12b)中比NNO和DeepNNO都高超过10%。特别值得注意的是,Core50的挑战(即在不同条件下进行训练和测试采集)使得DeepNNO和NNO无法正确建模置信阈值,从而拒绝了大多数已知类别的样本。实际上,通过引入拒绝选项,DeepNNO和NNO的准确率分别降至27.2%和26.3%,而B - DOC达到了38.0%的平均准确率。

In Fig. 3.12c and Fig. 3.12d, we report the OWR performances (standard and harmonic) on Core50. While DeepNNO surpasses the performance of NNO in both metrics (5.4% in standard OWR and 3.1% in OWR-H), B-DOC performs even better, outperforming DeepNNO by 3.4% and 7.2% in average respectively in standard OWR and OWR-H metrics, confirming the effectiveness of the clustering losses and the learned class-specific maximal distances.
在图3.12c和图3.12d中,我们报告了Core50上的开放世界识别(OWR)性能(标准和调和指标)。虽然DeepNNO在两个指标上都超越了NNO的性能(标准OWR中高5.4%,OWR - H中高3.1%),但B - DOC表现更好,在标准OWR和OWR - H指标上分别比DeepNNO平均高3.4%和7.2%,证实了聚类损失和学习到的特定类别的最大距离的有效性。

Finally, in Fig. 3.13 we report the results on the CIFAR-100 dataset in terms of the OWR (Fig. 3.13a) and OWR-H metrics (Fig. 3.13b). Even in this benchmark, confirms the finding of previous analysis: end-to-end trained methods with updated thresholds (DeepNNO and B-DOC ) are more effective than shallow methods (NNO). Similarly to previous analyses, B-DOC outperforms, on average both DeepNNO and NNO, with lower performances only in the initial training stage. However, in the incremental learning steps B-DOC clearly outperforms both methods, demonstrating its ability to learning and recognizing in an open-world without forgetting old classes. The relative gaps are still remarkable. DeepNNO improves over NNO by 6.3% and 4% in OWR and OWR-H metrics respectively. However,in the incremental steps, the average improvement of B-DOC over NNO are of 10% in both OWR and OWR-H metrics,while over DeepNNO are of 2% for the OWR and 4.5% for the OWR-H metric.
最后,在图3.13中,我们报告了在CIFAR - 100数据集上基于开放世界识别率(OWR,图3.13a)和开放世界识别率 - H(OWR - H,图3.13b)指标的结果。即使在这个基准测试中,也证实了先前分析的结果:采用更新阈值的端到端训练方法(深度神经网络优化器(DeepNNO)和二元动态开放分类器(B - DOC))比浅层方法(神经网络优化器(NNO))更有效。与之前的分析类似,B - DOC平均而言优于DeepNNO和NNO,仅在初始训练阶段性能较低。然而,在增量学习步骤中,B - DOC明显优于这两种方法,证明了它在开放世界中学习和识别且不遗忘旧类别的能力。相对差距仍然显著。DeepNNO在OWR和OWR - H指标上分别比NNO提高了6.3%4%。然而,在增量步骤中,B - DOC在OWR和OWR - H指标上比NNO平均提高了10%,而在OWR指标上比DeepNNO提高了2%,在OWR - H指标上提高了4.5%。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_144.jpg?x=268&y=269&w=544&h=466&r=0

Figure 3.14. CIFAR-100 results in the closed world scenario.
图3.14. 封闭世界场景下CIFAR - 100的结果。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_144.jpg?x=841&y=260&w=533&h=473&r=0

Figure 3.15. CIFAR-100: open world performances varying the number of known and unknown classes.
图3.15. CIFAR - 100:开放世界性能随已知和未知类别数量的变化。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_144.jpg?x=269&y=877&w=541&h=438&r=0

Figure 3.16. CIFAR-100 results of DeepNNO in the closed world scenario for different values of w .
图3.16. 封闭世界场景下,不同w值对应的DeepNNO在CIFAR - 100上的结果。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_144.jpg?x=837&y=879&w=540&h=436&r=0

Figure 3.17. CIFAR-100 results of DeepNNO in the closed world scenario for different values of λ .
图3.17. 封闭世界场景下,不同λ值对应的DeepNNO在CIFAR - 100上的结果。


Ablation study of DeepNNO
DeepNNO的消融研究


DeepNNO improves over NNO by introducing two main aspects: end-to-end trained deep representations with an updated rejection threshold, and a distillation loss to preserve old knowledge. In the following we analyze in detail the reasons behind the improvement of DeepNNO with respect to NNO on the CIFAR-100 dataset, focusing first on the importance of learning deep representations with updated thresholds and then on the impact of the distillation loss.
DeepNNO通过引入两个主要方面来改进NNO:具有更新拒绝阈值的端到端训练的深度表示,以及用于保留旧知识的蒸馏损失。接下来,我们详细分析在CIFAR - 100数据集上DeepNNO相对于NNO性能提升的原因,首先关注学习具有更新阈值的深度表示的重要性,然后关注蒸馏损失的影响。
Deep representation and updated threshold. We start by performing experiments in the closed world scenario, i.e. measuring the performances considering only the set of known classes. In particualr, we compare the performance of DeepNNO with NNO and DeepNNO without rejection option (i.e. DeepNNO-no rejection). The latter baseline method is the upper bound of DeepNNO in terms of performances in the closed world, since it does not reject any instance of known classes (i.e. it does not identify samples of known classes as unknowns). This baseline is used to demonstrate the validity of the method in Eq. (3.19) for setting the threshold Δ . The results are shown in Fig. 3.14 where the numbers between parenthesis denote the average accuracy among the different incremental steps. From Fig. 3.14 it is possible to draw two observations. First, there is a large gap between the performances of DeepNNO and NNO, with our model outperforming its non-deep counterpart by more than 16% on average and by more than 20% after all the incremental steps.
深度表示和更新阈值。我们首先在封闭世界场景中进行实验,即仅考虑已知类别集合来衡量性能。具体而言,我们将DeepNNO的性能与NNO和无拒绝选项的DeepNNO(即DeepNNO - 无拒绝)进行比较。后一种基线方法是封闭世界中DeepNNO性能的上限,因为它不会拒绝任何已知类别的实例(即不会将已知类别的样本识别为未知样本)。该基线用于证明式(3.19)中设置阈值Δ的方法的有效性。结果如图3.14所示,括号内的数字表示不同增量步骤的平均准确率。从图3.14中可以得出两个观察结果。首先,DeepNNO和NNO的性能之间存在很大差距,我们的模型平均比非深度对应模型高出16%以上,在所有增量步骤后高出20%以上。

The improved performance of DeepNNO can be ascribed to the fact that, by dynamically updating the learned feature representations, DeepNNO is able to better adapt the learned classifier to novel semantic concepts. Second, DeepNNO achieves results close to DeepNNO without rejection. This indicates that, thanks to the proposed approach for setting the threshold Δ ,DeepNNO only rarely identifies samples of known classes as belonging to an unknown category. We believe this is mainly due to the introduction of the different weighting factors w and w+ while updating Δ . This observation is confirmed by results shown in Fig. 3.16 which analyzes the effect of varying w with w+ fixed to 1 . As w decreases,the accuracy decreases as well,due to the higher value reached by Δ which leads to wrongly reject many samples, classified as instances of unknown classes. We want to highlight however that for more complex and realistic scenario, the threshold obtained by DeepNNO does not generalize as well and more the more principled strategy of B-DOC results more effective, as we will show in the next subsection.
DeepNNO性能的提升可归因于以下事实:通过动态更新学习到的特征表示,DeepNNO能够更好地使学习到的分类器适应新的语义概念。其次,DeepNNO取得的结果接近无拒绝的DeepNNO。这表明,由于所提出的设置阈值Δ的方法,DeepNNO很少将已知类别的样本识别为属于未知类别。我们认为这主要是由于在更新Δ时引入了不同的加权因子ww+。图3.16所示的结果证实了这一观察,该图分析了在w+固定为1的情况下改变w的影响。随着w的减小,准确率也会降低,这是因为Δ达到了较高的值,导致错误地拒绝了许多被分类为未知类别的样本。然而,我们想强调的是,对于更复杂和现实的场景,DeepNNO获得的阈值泛化能力不佳,而B - DOC更有原则性的策略会更有效,我们将在下一小节中展示。

As a second experiment, we compare the performances of DeepNNO and NNO in the open world recognition scenario varying the number of known and unknown classes. The results are shown in Fig. 3.15, from which it is easy to see that DeepNNO outperforms its non-deep counterpart by a large margin. In fact, in this scenario, our model achieves a standard OWR accuracy 9% higher than standard NNO on average, considering 50 unknown classes. Moreover, this margin increases during the training: after all the incremental steps our model outperforms NNO by a margin close to 15% . It is worth noting that the advantages of our model are independent on the number of unknown classes, since DeepNNO constantly outperforms NNO in all settings.
作为第二个实验,我们在开放世界识别场景中比较了深度神经优化器(DeepNNO)和神经优化器(NNO)在已知和未知类别数量变化时的性能。结果如图3.15所示,从中很容易看出,深度神经优化器(DeepNNO)的性能远远优于非深度的神经优化器(NNO)。实际上,在这种场景下,考虑到50个未知类别,我们的模型平均比标准神经优化器(NNO)的标准开放世界识别(OWR)准确率高出9%。此外,在训练过程中,这种差距会增大:在所有增量步骤之后,我们的模型比神经优化器(NNO)的性能高出接近15% 。值得注意的是,我们模型的优势与未知类别的数量无关,因为在所有设置中,深度神经优化器(DeepNNO)始终优于神经优化器(NNO)。

Distillation loss. Another important component of DeepNNO is the distillation loss. This loss guarantees the right balance between learning novel concepts and preserving old features. To analyze its impact, in Fig. 3.17 we report the performances of DeepNNO in the closed world scenario for different values of λ . From the figure it is clear that, without the regularization effect of the distillation loss, the accuracy significantly drops. On the other hand,a high value of λ leads to poor performances and low confidence on the novel categories. Properly balancing the contribution of classification and distillation loss the best performance can be achieved. The use of the distillation loss is thus crucial for limiting the catastrophic forgetting, as previously verified in [144, 216].
蒸馏损失。深度神经优化器(DeepNNO)的另一个重要组成部分是蒸馏损失。这种损失保证了学习新的概念和保留旧特征之间的正确平衡。为了分析其影响,在图3.17中,我们报告了深度神经优化器(DeepNNO)在封闭世界场景中针对不同λ 值的性能。从图中可以清楚地看到,如果没有蒸馏损失的正则化效果,准确率会显著下降。另一方面,λ 值过高会导致性能不佳,并且对新类别的置信度较低。通过适当平衡分类损失和蒸馏损失的贡献,可以实现最佳性能。因此,如之前在文献[144, 216]中所验证的那样,使用蒸馏损失对于限制灾难性遗忘至关重要。

MethodKnownClassesOWR
11162126[20]H
GC66.057.358.653.358.858.7
LC64.156.057.956.458.658.4
Triplet62.154.954.849.555.455.4
GC+LC67.759.659.557.361.060.8
方法已知类别开放世界识别(Open World Recognition,OWR)
11162126[20]H
全局约束(Global Constraint,GC)66.057.358.653.358.858.7
局部约束(Local Constraint,LC)64.156.057.956.458.658.4
三元组62.154.954.849.555.455.4
GC+LC67.759.659.557.361.060.8

Table 3.12. Ablation study of B-DOC on the global (GC), local clustering (LC) and Triplet loss on the OWR metric. The right column shows the average OWR-H over all steps.
表3.12. B-DOC在全局聚类(GC)、局部聚类(LC)以及三元组损失方面对OWR指标的消融研究。右列显示了所有步骤的平均OWR-H。

MethodClass specificMulti stageKnownUnknownDiff.
DeepNNO84.498.814.4
B-DOC83.098.615.6
4.426.922.6
27.465.237.8
方法特定类别Multi 阶段已知未知差异
深度神经网络优化器(DeepNNO)84.498.814.4
基于边界的离群点检测(B - DOC)83.098.615.6
4.426.922.6
27.465.237.8

Table 3.13. Rejection rates of different techniques for detecting the unknowns. The results are computed using the same feature extractor on the RGB-D Object dataset.
表3.13. 不同未知样本检测技术的拒识率。这些结果是在RGB - D物体数据集上使用相同的特征提取器计算得出的。


Ablation study of B-DOC
B - DOC的消融研究


B-DOC is mainly built on three components, i.e. global clustering loss (GC), local clustering loss (LC) and the learned class-specific rejection thresholds. In the following we analyze the contribution of each of them. We start from the two clustering losses and then we compare the choice we made for the rejection with other common choices.
B - DOC主要基于三个组件构建,即全局聚类损失(GC)、局部聚类损失(LC)和学习到的特定类别的拒识阈值。接下来我们分析每个组件的贡献。我们先从两个聚类损失开始,然后将我们选择的拒识方法与其他常见方法进行比较。

Global and local clustering. In Table 3.12 we compare the two clustering terms considering the open world recognition metrics in the RGB-D Object dataset. By analyzing the two loss terms separately we see that, on average, they show similar performance. In particular, using only the global clustering (GC) term we achieve slightly better performance on the first three incremental steps, while on the fourth the local clustering (LC) term is better. However, the best performance on every step is achieved by combining the global and local clustering terms (GC + LC). This demonstrates that the two losses provide different contributions, being complementary to learn a representation space which properly clusters samples of the same classes while better detecting unknowns.
全局和局部聚类。在表3.12中,我们在RGB - D物体数据集上考虑开放世界识别指标,对两个聚类项进行了比较。通过分别分析这两个损失项,我们发现平均而言,它们表现出相似的性能。具体来说,仅使用全局聚类(GC)项时,我们在前三个增量步骤中取得了稍好的性能,而在第四步中,局部聚类(LC)项表现更好。然而,在每一步中,将全局和局部聚类项(GC + LC)结合使用可取得最佳性能。这表明这两个损失提供了不同的贡献,它们相互补充,有助于学习一个能正确聚类同类样本并更好地检测未知样本的表示空间。

Lastly, since the B-DOC loss functions and triplet loss [11] share the same objective, i.e. building a metric space where samples sharing the same semantic are closer then ones with different semantics, we report in Table 3.12 also the results achieved by replacing our loss terms with a triplet loss [11]. As the Table shows, the triplet loss formulation (Triplet) fails in reaching competitive results with respect to our full objective function in Eq. (3.29),with a gap of more than 5% in both standard OWR metric and OWR harmonic mean. Notably, it achieves lower results also with respect to all of the loss terms in isolation and the superior performances of LC confirm the advantages of SNNL-based loss functions with respect to triplets, as shown in [74].
最后,由于B - DOC损失函数和三元组损失[11]具有相同的目标,即构建一个度量空间,使具有相同语义的样本比具有不同语义的样本更接近,我们在表3.12中还报告了用三元组损失[11]替换我们的损失项所取得的结果。如表所示,三元组损失公式(Triplet)在与我们式(3.29)中的完整目标函数相比时,未能取得有竞争力的结果,在标准OWR指标和OWR调和均值上的差距都超过了5%。值得注意的是,与所有单独的损失项相比,它取得的结果也更低,而LC的优越性能证实了基于SNNL的损失函数相对于三元组的优势,如文献[74]所示。
Detecting the Unknowns. In Table 3.13 we report a comparison of different strategies to reject samples on the RGB-D Object dataset [127]. In particular, using the same feature extractor, we compare the proposed method to learn the class-specific maximal distances (i.e. Eq. (3.31)) with three baselines: (i) the online update strategy of DeepNNO (Eq. (3.19)), (ii) we learn class-specific maximal distances but during training (i.e. without our two-stage pipeline) and (iii) we learn a single maximal distance which applies to all classes using our two-stage training strategy.
未知样本检测。在表3.13中,我们报告了在RGB - D物体数据集[127]上不同样本拒识策略的比较。具体来说,使用相同的特征提取器,我们将所提出的学习特定类别最大距离的方法(即式(3.31))与三个基线方法进行比较:(i)DeepNNO的在线更新策略(式(3.19));(ii)我们学习特定类别的最大距离,但在训练期间进行(即不使用我们的两阶段流程);(iii)我们使用两阶段训练策略学习一个适用于所有类别的单一最大距离。

The comparison is performed considering the difference of the rejection rates on the known and unknown samples. For the known class samples, we report the percentage of correctly classified samples in the closed-world that are rejected when the rejection option is included. We intentionally remove the wrongly classified samples since we want to isolate rejection mistakes from classification ones. On the unknown samples, we report the open-set accuracy, i.e. the percentage of rejected samples among all the unknown ones. In the third column, we report the difference among the open-set accuracy and the rejection rate on known samples. Ideally, the difference should be as close as possible to 100% ,since we want a 100% rejection rate on unknown class samples and 0% on the known class ones.
比较是基于已知样本和未知样本的拒识率差异进行的。对于已知类别的样本,我们报告了在包含拒识选项时,封闭世界中被正确分类的样本被拒识的百分比。我们有意去除了分类错误的样本,因为我们想将拒识错误与分类错误区分开来。对于未知样本,我们报告了开放集准确率,即所有未知样本中被拒识的样本的百分比。在第三列中,我们报告了开放集准确率与已知样本拒识率之间的差异。理想情况下,这个差异应尽可能接近100%,因为我们希望未知类样本的拒识率为100%,而已知类样本的拒识率为0%。

From the table, we can see that the highest gap is achieved by the class-specific maximal distance with the two-stage pipeline we proposed, which rejects 27.4% of known class samples and 65.2% on the unknown ones. The gap with the other strategies is remarkable. Using the two stage-pipeline but a class-generic maximal distance leads to a low rejection rate, both on known and unknown samples, achieving a difference of 22.6%, which is 15.2% less than using a class-specific distance. On the other hand, estimating the confidence threshold as proposed in DeepNNO or without our two-stage pipeline provides a very high rejection rate, both on known and unknown classes,which lead to a difference of 14.4% and 15.6% for DeepNNO and the single-stage strategy respectively, the lowest two among the four strategies. In fact, computing the thresholds using only the training set biases the rejection criterion on the overconfidence that the method has acquired on this set. Consequently, this makes the model considering the different test data distribution (caused by e.g. different object instances) as a source for rejection even if the actual concept present in the input is known. Using the two-stage process we can overcome this bias, tuning the rejection criterion on unseen data on which the model cannot be overconfident.
从表中可以看出,我们提出的具有两阶段流程的特定类别最大距离方法取得了最大的差距,该方法拒识了27.4%的已知类样本和65.2%的未知类样本。与其他策略的差距非常显著。使用两阶段流程但采用通用类别的最大距离会导致已知和未知样本的拒识率都较低,差距为22.6%,比使用特定类别距离低15.2%。另一方面,如DeepNNO中所提出的那样估计置信阈值,或者不使用我们的两阶段流程,会导致已知和未知类别的拒识率都非常高,DeepNNO和单阶段策略的差距分别为14.4%和15.6%,是四种策略中最低的两个。实际上,仅使用训练集计算阈值会使拒识标准偏向于该方法在训练集上获得的过度自信。因此,这会使模型将不同的测试数据分布(例如由不同的物体实例引起的)视为拒识的原因,即使输入中实际存在的概念是已知的。使用两阶段过程可以克服这种偏差,根据模型无法过度自信的未见数据调整拒识标准。

3.5.6 Towards Autonomous Visual Systems: Web-aided OWR
3.5.6 迈向自主视觉系统:网络辅助的开放世界识别


The OWR frameworks considered so far assumes the existence of an 'oracle', providing annotated images for each new class. In a robotic scenario, this has been often translated into having a human in the loop, with the robot asking for images and labels. This scenario somehow limits the autonomy of a robot system, that without the presence of a teacher, would find itself stuck when detecting a new object. Moreover, especially in robotics applications, this assumption is highly unrealistic since: i) the labels of samples of unknown categories are, by definition, unknown; ii) images of the unknown classes for incrementally updating the model are usually unavailable, since it is impossible to have a pre-loaded database containing all possible classes existing in the real world. In the last part of this section we want to describe a simple general pipeline to address the aforementioned issues with first pilot experiments showing its possible application.
到目前为止所考虑的开放世界识别(OWR)框架假定存在一个“神谕”,能为每个新类别提供带注释的图像。在机器人场景中,这通常意味着需要有人参与其中,由机器人请求获取图像和标签。这种场景在某种程度上限制了机器人系统的自主性,因为在没有教师在场的情况下,机器人在检测到新物体时会陷入困境。此外,特别是在机器人应用中,这种假设极不现实,原因如下:其一,根据定义,未知类别的样本标签是未知的;其二,用于逐步更新模型的未知类别的图像通常无法获取,因为不可能预先加载一个包含现实世界中所有可能类别的数据库。在本节的最后部分,我们想描述一个简单的通用流程,以解决上述问题,并通过初步的试点实验展示其可能的应用。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_148.jpg?x=268&y=257&w=1089&h=381&r=0

Figure 3.18. Overview of the open world recognition task within a robotic platform. Given an image of an object, a classification algorithm assigns to it a class label. If the object is recognized as novel, the object label and relative are obtained through external resource (e.g. a human and/or the Web). Finally, the images are used to incrementally updated the knowledge base of the robot.
图3.18. 机器人平台内开放世界识别任务概述。给定一个物体的图像,分类算法会为其分配一个类别标签。如果该物体被识别为新物体,则通过外部资源(例如人类和/或网络)获取物体标签及相关信息。最后,这些图像用于逐步更新机器人的知识库。


We start by considering the problem of retrieving the correct label of an unknown object. To this extent, we exploit standard search tools used by humans. First, once an object is recognized as unknown,we query the Google Image Search engine 9 to retrieve the closest keyword to the current image. Obviously the retrieved label might not be correct e.g. due to low resolution of the image or a non canonical pose of the object. We tackle this issue through an additional human verification step, leaving the investigation of this problem to future works. As subsequent step, we use the retrieved keyword to automatically download images from the web. These weakly-annotated and noisy images represent new training data for the novel category which can be used to incrementally train the deep network. Fig. 3.18 shows an overview of our pipeline. Interestingly, this simple framework mimics the human ability to learn not only from situated experiences, but also from visual knowledge externalized on artifacts (e.g. like drawings), or indeed Web resources.
我们首先考虑检索未知物体正确标签的问题。为此,我们利用人类常用的标准搜索工具。首先,一旦某个物体被识别为未知物体,我们就会查询谷歌图片搜索引擎9,以获取与当前图像最接近的关键词。显然,检索到的标签可能并不正确,例如可能是由于图像分辨率低或物体姿态不规范。我们通过额外的人工验证步骤来解决这个问题,将对该问题的深入研究留待未来工作。接下来,我们使用检索到的关键词自动从网络上下载图像。这些带有弱注释且有噪声的图像代表了新类别的新训练数据,可用于逐步训练深度网络。图3.18展示了我们的流程概述。有趣的是,这个简单的框架模仿了人类不仅能从实际经验中学习,还能从人工制品(例如绘画)或网络资源中外部化的视觉知识中学习的能力。

We conduct a first series of preliminary experiments, using web-images in the incremental learning steps of DeepNNO, to validate the feasibility of this pipeline. The results of our experiments are shown in Fig. 3.19 for CIFAR-100 and in Fig. 3.20 for Core50. As expected, considering images from the Web instead of images from the datasets lead to a decrease in terms of performance. However, the accuracy of the Web-based DeepNNO is still good, especially when compared with its non-deep counterpart.
我们进行了一系列初步实验,在深度非负优化(DeepNNO)的增量学习步骤中使用网络图像,以验证该流程的可行性。我们的实验结果在图3.19(针对CIFAR - 100数据集)和图3.20(针对Core50数据集)中展示。正如预期的那样,与使用数据集中的图像相比,使用网络图像会导致性能下降。然而,基于网络的DeepNNO的准确率仍然不错,特别是与非深度版本相比。

On the CIFAR-100 experiments we achieve a remarkable performance, with Web DeepNNO outperforming NNO by 3.5% on average and by more than 5% after all the incremental steps, with respect to the standard OWR metric. We highlight that these results have been achieved exploiting only noisy and weakly labeled Web images, without any filtering procedure or additional optimization constraints. On the Core50 experiments, the gap Between DeepNNO and NNO is lower, as shown in Fig, 3.12c and 3.12d and this impacts also the results of the Web-based version of DeepNNO, achieving a modest improvement with respect to NNO. We ascribe this behavior to the fact that there is a large appearance gap between Core50 images gathered in an egocentric setting and Web images, thus both the rejection threshold and the semantic centroids of new classes are not able to well model the underline data distribution, with deteriorated final results. We believe that this issue can be addressed in future works by e.g. imposing some constraints on the quality of downloaded images and by coupling DeepNNO with domain adaptation techniques [203,28,166,169] in order to reduce the domain shift between downloaded images and training data.
在CIFAR - 100实验中,我们取得了显著的性能表现。相对于标准的开放世界识别(OWR)指标,基于网络的DeepNNO平均比非负优化(NNO)高出3.5%,在所有增量步骤之后高出超过5%。我们强调,这些结果是仅利用有噪声且标签较弱的网络图像实现的,没有进行任何过滤程序或额外的优化约束。在Core50实验中,如图3.12c和3.12d所示,DeepNNO和NNO之间的差距较小,这也影响了基于网络的DeepNNO版本的结果,与NNO相比仅取得了适度的改进。我们将这种现象归因于以下事实:在以自我为中心的环境中收集的Core50图像与网络图像之间存在较大的外观差异,因此新类别的拒绝阈值和语义质心都无法很好地对底层数据分布进行建模,从而导致最终结果变差。我们相信,未来的工作可以通过例如对下载图像的质量施加一些约束,以及将DeepNNO与领域适应技术[203,28,166,169]相结合,以减少下载图像与训练数据之间的领域偏移,来解决这个问题。


9 https://images.google.com/




https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_149.jpg?x=272&y=260&w=534&h=432&r=0

Figure 3.19. CIFAR-100: performances of Web-aided OWR in the open world scenario, with 50 unknown classes.
图3.19. CIFAR - 100:在开放世界场景中,有50个未知类别的网络辅助开放世界识别(OWR)性能。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_149.jpg?x=837&y=259&w=540&h=436&r=0

Figure 3.20. Core50 dataset: performances of Web-aided OWR in the open world scenario, with 5 unknown classes.
图3.20. Core50数据集:在开放世界场景中,有5个未知类别的网络辅助开放世界识别(OWR)性能。


To validate the applicability of the pipeline in a real scenario, we tested the Web-aided version of DeepNNO by integrating it into a visual object detection framework and running it on a Yumi 2-arm manipulator equipped with a Kinect. We have used the Faster-RCNN framework in [219] with the ResNet-101 architecture [99] as backbone. We pre-trained the network on the COCO dataset [147], after replacing the standard fully-connected classifier with the proposed DeepNNO. We performed an open world detection experiment by placing multiple objects (known and unknowns) in the workspace of the robot. Whenever a novel object is detected, the robot tries to get the corresponding label from Google Image Search, using the cropped image of the unknown object. In case the label is not correct, a human operator cooperates with the robot and provides the right label. The provided label is used by the robot to automatically download the images associated to the novel class from the Web sources. These images and the original one where the object has been detected in the workspace, are then used to update the classification model.
为了验证该流程在实际场景中的适用性,我们通过将DeepNNO的网络辅助版本集成到一个视觉目标检测框架中,并在配备了Kinect的Yumi双臂机械手上运行它,对其进行了测试。我们采用了文献[219]中的Faster - RCNN框架,并以ResNet - 101架构[99]作为骨干网络。在将标准的全连接分类器替换为所提出的DeepNNO之后,我们在COCO数据集[147]上对网络进行了预训练。我们通过在机器人的工作空间中放置多个物体(已知和未知物体)进行了开放世界检测实验。每当检测到一个新物体时,机器人会尝试使用未知物体的裁剪图像从谷歌图像搜索中获取相应的标签。如果标签不正确,人类操作员会与机器人协作并提供正确的标签。机器人会使用所提供的标签自动从网络资源中下载与该新类别相关的图像。然后,这些图像以及在工作空间中检测到该物体的原始图像将用于更新分类模型。

Figure 3.21 shows a qualitative result associated to our experiment. The robot was able to correctly detect the red hammer as unknown, add it in its knowledge base and recognize it in subsequent learning steps. 10 Despite the simplicity of the workspace, we want to highlight that the robot was able to recognize the hammer without any explicitly labeled training data for the class of interest.
图3.21展示了与我们的实验相关的定性结果。机器人能够正确地将红色锤子检测为未知物体,将其添加到知识库中,并在后续的学习步骤中识别它。10 尽管工作空间很简单,但我们想强调的是,机器人能够在没有任何针对感兴趣类别的明确标注训练数据的情况下识别出锤子。


10A full example is available in the supplementary material of [167].
10A 完整示例可在文献[167]的补充材料中获取。




https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_150.jpg?x=267&y=255&w=1117&h=593&r=0

Figure 3.21. Qualitative results of deployment of DeepNNO on a robotic platform. The robot recognizes an object as unknown (i.e. the red hammer, bottom) and adds it to the knowledge base through the incremental learning procedure (top right).
图3.21. DeepNNO在机器人平台上部署的定性结果。机器人将一个物体识别为未知物体(即底部的红色锤子),并通过增量学习过程将其添加到知识库中(右上角)。


We want to point out that here we are not claiming that our framework is incorporating new knowledge into a visual robotic system in a completely autonomous and fully effective way. Indeed (i) the human verification step on the retrieved keyword is necessary and (ii) web supervision [57, 41] requires to address challenges such as noisy labels [193] and domain shift [287], which we did not take into account. Nevertheless, we still believe our experiments show how our pipeline is a feasible starting point which is worth exploring in future research directions toward autonomous learners in the real world.
我们想指出的是,这里我们并不是声称我们的框架能够以完全自主且完全有效的方式将新知识融入到视觉机器人系统中。实际上,(i)对检索到的关键词进行人工验证步骤是必要的;(ii)网络监督[57, 41]需要解决诸如噪声标签[193]和领域偏移[287]等挑战,而我们并未考虑这些问题。尽管如此,我们仍然相信我们的实验表明了我们的流程是一个可行的起点,值得在未来朝着现实世界中的自主学习者的研究方向进行探索。

3.5.7 Conclusions
3.5.7 结论


In this section, we presented two approaches to tackle the open world recognition problem in robot vision. We base our approaches on an NCM classifier built on top of end-to-end trainable deep features (DeepNNO), and we further boost the OWR performances of this framework by training the deep architecture to minimize a global to local semantic clustering loss (B-DOC) which allows reducing distances of samples of the same class in the feature space while separating them from points belonging to other classes, better detecting unknown concepts. In B-DOC we also avoid heuristic estimates of a rejection criterion for detecting unknowns by explicitly learning class-specific distances beyond which a sample is rejected. Quantitative and qualitative analysis on standard recognition benchmarks shows the efficacy of the proposed approaches and choices, outperforming the previous state-of-the-art OWR algorithm. Finally, we also showed preliminary experiments with a simple pipeline for allowing the robot to autonomously learn new semantic concepts, without the aid of an oracle providing it with a training set containing the desired target classes.
在本节中,我们提出了两种方法来解决机器人视觉中的开放世界识别问题。我们的方法基于一个构建在端到端可训练深度特征之上的最近类均值(NCM)分类器(DeepNNO),并且我们通过训练深度架构来最小化全局到局部语义聚类损失(B - DOC),进一步提升了该框架的开放世界识别(OWR)性能。这种损失函数能够缩小特征空间中同一类样本之间的距离,同时将它们与其他类的样本分开,从而更好地检测未知概念。在B - DOC中,我们还通过显式学习特定类别的距离(超过该距离的样本将被拒绝),避免了对检测未知物体的拒绝准则进行启发式估计。在标准识别基准上的定量和定性分析表明了所提出的方法和选择的有效性,优于之前的最先进的OWR算法。最后,我们还展示了一个简单流程的初步实验,该流程允许机器人在没有神谕(oracle)提供包含所需目标类别的训练集的情况下,自主学习新的语义概念。
Future works will further investigate webly supervised approaches with the goal of pushing the envelope in life-long learning of autonomous systems. In particular, when training images are autonomously retrieved from the Web, they come with inherent noisy labeling (e.g. wrong semantic) and domain shift (e.g. white backgrounds). Attacking all these problems would allow active visual systems to get closer to full autonomy. In an intermediate direction, it would be interesting to analyze the OWR problem in an active learning context [199], letting the robot decide when to ask for human help for either collecting data or label new concepts.
未来的工作将进一步研究网络监督方法,目标是在自主系统的终身学习方面取得更大进展。特别是,当训练图像是从网络中自主检索时,它们会带有固有的噪声标签(例如,错误的语义)和领域偏移(例如,白色背景)。解决所有这些问题将使主动视觉系统更接近完全自主。在一个中间方向上,在主动学习的背景下分析OWR问题会很有趣[199],让机器人决定何时请求人类帮助来收集数据或标注新的概念。

This section concludes our line of works on incrementally injecting new knowledge in a pre-trained deep model under various scenarios, with (ICL) or without (multi-domain learning) shared output spaces with old knowledge, and with (ICL, multi-domain learning) or without (OWR) closed-world assumption. Additionally, we identified problems (e.g. semantic shift of the background class) and posed challenges (web-aided OWR) not tackled in the community. Nevertheless, differently from Chapter 2, here we consider the training and test distributions to belong to be equal, without any domain shift problem. On the other hand, differently from the techniques presented in Chapter 2, this chapter described techniques that allow modifying the output space of a pre-trained architecture. In the next chapter, we will merge these two worlds together, describing the first method capable of recognizing unseen semantic concepts in unseen visual domains.
本节结束了我们关于在各种场景下向预训练深度模型中逐步注入新知识的一系列工作,这些场景包括有(增量类学习,ICL)或没有(多领域学习)与旧知识共享的输出空间,以及有(ICL、多领域学习)或没有(OWR)封闭世界假设的情况。此外,我们还识别出了社区中尚未解决的问题(例如,背景类的语义偏移)并提出了挑战(网络辅助的OWR)。然而,与第2章不同的是,这里我们假设训练和测试分布是相等的,没有任何领域偏移问题。另一方面,与第2章中介绍的技术不同,本章描述的技术允许修改预训练架构的输出空间。在下一章中,我们将把这两个方面结合起来,描述第一种能够在未见视觉领域中识别未见语义概念的方法。

Chapter 4 Towards Recognizing Unseen Categories in Unseen Domains
第4章 迈向识别未见领域中的未见类别


While in the previous chapters we considered methods extending a pretrained model either to new input distributions or to new semantic concepts, an open research question is whether we can address the two problems together, producing a deep model able to recognize new semantic concepts (i.e. addressing the semantic shift) in possibly unseen domains (i.e. addressing the domain shift). In this chapter, we start analyzing how we can merge these two worlds, providing a first attempt in this direction in an offline but quite extreme setting. In particular, we considered a scenario where, during training, we are given a set of images of multiple domains and semantic categories and our goal is to build a model able to recognize images of unseen concepts, as in zero-shot learning (ZSL), in unseen domains, as in domain-generalization (DG). This novel problem, which we called ZSL under DG (ZSL+DG), poses novel research questions going beyond the ones posed by the DG and ZSL problems if taken in isolation. For instance,similarly to DG ,we can rely on the fact that the multiple source domains permit to disentangle semantic and domain-specific information. However, differently from DG, we have no guarantee that the disentanglement will hold for the unseen semantic categories at test time. Moreover, while in ZSL it is reasonable to assume that the learned mapping between images and semantic attributes will generalize also to test images of the unseen concepts, in ZSL+DG we have no guarantee that this will happen for images of unseen domains. In Section 4.1 we provide a formal definition of the problem, while in Sec. 4.2 we review the related works in the zero-shot learning literature and domain generalization. In Section 4.3 we provide a first solution to this problem by designing a curriculum strategy based on the mixup [301] algorithm. In particular, we use mixup both at the input and feature level to simulate the domain shift and semantic shift the network will encounter at test time. Experiments show how this approach is effective in both ZSL, DG ,and the two tasks together,producing one of the first attempts for recognizing unseen categories in unseen domains.
在前面的章节中,我们探讨了将预训练模型扩展到新输入分布或新语义概念的方法,而一个尚未解决的研究问题是,我们能否同时解决这两个问题,构建一个深度模型,使其能够在可能未见的领域(即解决领域偏移问题)中识别新的语义概念(即解决语义偏移问题)。在本章中,我们开始分析如何将这两个方面结合起来,并在离线但相当极端的设置下朝着这个方向进行首次尝试。具体而言,我们考虑了这样一种场景:在训练过程中,我们获得了多个领域和语义类别的一组图像,我们的目标是构建一个模型,使其能够像零样本学习(ZSL)那样识别未见概念的图像,同时像领域泛化(DG)那样在未见领域中进行识别。我们将这个新问题称为领域泛化下的零样本学习(ZSL+DG),它提出了一些新的研究问题,超越了单独考虑DG和ZSL问题时所提出的问题。例如,与DG类似,我们可以利用多个源领域能够分离语义信息和特定领域信息这一事实。然而,与DG不同的是,我们无法保证在测试时这种分离对于未见语义类别仍然有效。此外,在ZSL中,合理的假设是所学习的图像与语义属性之间的映射也能推广到未见概念的测试图像上,但在ZSL+DG中,我们无法保证这种情况会在未见领域的图像上发生。在4.1节中,我们对该问题进行了正式定义,而在4.2节中,我们回顾了零样本学习文献和领域泛化方面的相关工作。在4.3节中,我们通过设计一种基于混合(mixup)[301]算法的课程策略,为这个问题提供了第一个解决方案。具体来说,我们在输入和特征层面都使用了混合方法,以模拟网络在测试时将遇到的领域偏移和语义偏移。实验表明,这种方法在ZSL、DG以及这两个任务的结合上都是有效的,是在未见领域中识别未见类别的首次尝试之一。

4.1 Problem statement
4.1 问题陈述


Overview. As highlighted in Chapter 1, most existing deep visual models are based on the assumptions that (a) training and test data come from the same underlying distribution, i.e. domain shift, and (b) the set of classes seen during training constitute the only classes that will be seen at test time, i.e. semantic shift. These assumptions rarely hold in practice and, in addition to depicting different semantic categories, training and test images may differ significantly in terms of visual appearance in the real world.
概述。正如第1章所强调的,大多数现有的深度视觉模型基于以下假设:(a)训练数据和测试数据来自相同的底层分布,即领域偏移;(b)训练期间所见的类别集合构成了测试时将见到的唯一类别,即语义偏移。这些假设在实践中很少成立,而且在现实世界中,除了描绘不同的语义类别外,训练图像和测试图像在视觉外观上可能存在显著差异。

Up to now, we have presented approaches that tackle these problems in isolation. In particular, in Chapter 2, we have considered the case where training and test distribution changes, addressing the domain shift problem, starting from the assumption of having target data available (Section 2.4), and removing it in the more complex domain generalization (Section 2.5), continuous (Section 2.6) and predictive domain adaptation (Section 2.7). However, in all the works we assumed the output space to be constant after the initial training stage and shared between training and test times.
到目前为止,我们已经介绍了分别解决这些问题的方法。具体而言,在第2章中,我们考虑了训练分布和测试分布发生变化的情况,解决了领域偏移问题,从假设目标数据可用(2.4节)开始,到在更复杂的领域泛化(2.5节)、连续领域适应(2.6节)和预测性领域适应(2.7节)中去除这一假设。然而,在所有这些工作中,我们都假设输出空间在初始训练阶段之后是恒定的,并且在训练和测试阶段是共享的。

On the other hand, in Chapter 3, we considered the case where the semantic space of a model is extended over time, as new training data arrives, but without the presence of the domain shift problem. In fact, while in Multi-Domain Learning (Section 3.3), a single model is asked to tackle different classification tasks in different visual domains, we have full supervision in each of the domains, and no unseen domain is received at test time. Similarly, in Incremental Learning (Section 3.4) and Open World Recognition (Section 3.5), we consider a single data distribution during all training steps and little to no shift at test time.
另一方面,在第3章中,我们考虑了随着新训练数据的到来,模型的语义空间随时间扩展的情况,但不存在领域偏移问题。事实上,在多领域学习(3.3节)中,要求单个模型在不同的视觉领域中处理不同的分类任务,但我们在每个领域中都有完全的监督信息,并且在测试时不会遇到未见领域。同样,在增量学习(3.4节)和开放世界识别(3.5节)中,我们在所有训练步骤中都考虑单一的数据分布,并且在测试时几乎没有或没有偏移。

In this chapter we focus on a different problem, considering the two shifts occurring jointly at test time. In particular, our goal is recognizing new semantic categories in new domains, without any of the categories and domains being present in our initial training set. In terms of the domain shift, we will consider the problem from a DG perspective (i.e. data of the target domain are not present during training while multiple sources are available). For the semantic shift, we will consider the problem as Zero-Shot Learning (ZSL) [278]. In ZSL, the goal is to recognize objects unseen during training given no data but external information about the novel classes provided in forms of semantic attributes [130], visual descriptions [2] or word embeddings [179]. We consider this problem because allows us to decouple semantic and domain shift, without considering other problems (e.g. catastrophic forgetting, see Section 3.2). Moreover, we will start by considering a ZSL scenario (i.e. at test time we want to recognize only unseen classes) and not the generalized ZSL one [278] (where both seen and unseen categories must be recognized) because this allows us to sidestep the inherent bias our model would have on seen classes, focusing solely on the domain and semantic shifts.
在本章中,我们关注一个不同的问题,考虑在测试时同时发生的两种偏移。具体而言,我们的目标是在新领域中识别新的语义类别,而初始训练集中不存在任何这些类别和领域。就领域偏移而言,我们将从领域泛化(Domain Generalization,DG)的角度来考虑这个问题(即目标领域的数据在训练期间不存在,而有多个源领域数据可用)。对于语义偏移,我们将此问题视为零样本学习(Zero-Shot Learning,ZSL)[278]。在零样本学习中,目标是在训练期间没有见过对象数据的情况下,根据以语义属性[130]、视觉描述[2]或词嵌入[179]等形式提供的关于新类别的外部信息来识别这些对象。我们考虑这个问题是因为它使我们能够将语义偏移和领域偏移解耦,而无需考虑其他问题(例如灾难性遗忘,见第3.2节)。此外,我们将首先考虑零样本学习场景(即测试时我们只想识别未见类别),而不是广义零样本学习场景[278](在该场景中必须同时识别已见和未见类别),因为这使我们能够避开模型对已见类别固有的偏差,仅专注于领域和语义偏移。

To clarify the setting, let us consider the case depicted in Fig. 4.1. A system trained to recognize elephants and horses from realistic images and cartoons might be able to recognize the same categories in another visual domain, like art paintings (Fig. 4.1, bottom) or it might be able to describe other quadrupeds in the same training visual domains (Fig. 4.1, top). On the other hand, how to deal with the case where new animals are shown in a new visual domain is not clear. We want to remark that, while the one of Fig. 4.1 is a toy example, the need for a holistic approach jointly recognizing unseen categories in unseen domains comes from the large variability of the real world itself. Since it is impossible to construct a training set containing such variability, we cannot train a model to be robust to all the possible environments and semantic inputs it might encounter. Addressing these two problems together, allows our models to be more robust to these variabilities. Applications, where we need such robustness, are countless. For example, given a robot manipulation task we cannot forecast a priori all the possible conditions (e.g. environments, lighting) it will be employed in. Moreover, we might have data only for a subset of objects we want to recognize while only descriptions for the others.
为了阐明这个设定,让我们考虑图4.1所示的情况。一个经过训练能从真实图像和卡通图像中识别大象和马的系统,可能能够在另一个视觉领域(如艺术绘画,图4.1底部)中识别相同的类别,或者能够在相同的训练视觉领域中描述其他四足动物(图4.1顶部)。另一方面,如何处理在新的视觉领域中出现新动物的情况并不明确。我们想强调的是,虽然图4.1是一个简单的示例,但需要一种整体方法来联合识别未见领域中的未见类别,这源于现实世界本身的巨大变异性。由于不可能构建一个包含如此多变异性的训练集,我们无法训练一个模型使其对可能遇到的所有可能环境和语义输入都具有鲁棒性。同时解决这两个问题,能使我们的模型对这些变异性更具鲁棒性。需要这种鲁棒性的应用数不胜数。例如,对于一个机器人操作任务,我们无法预先预测它将在哪些所有可能的条件(如环境、光照)下使用。此外,我们可能只有一部分想要识别的对象的数据,而对于其他对象只有描述信息。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_154.jpg?x=275&y=260&w=1102&h=520&r=0

Figure 4.1. Our ZSL+DG problem. During training we have images of multiple categories (e.g. elephant, horse) and domains (e.g. photo, cartoon). At test time, we want to recognize unseen categories (e.g. dog, giraffe), as in ZSL, in unseen domains (e.g. paintings), as in DG, exploiting side information describing seen and unseen categories.
图4.1. 我们的零样本学习 + 领域泛化(ZSL+DG)问题。在训练期间,我们有多个类别(如大象、马)和领域(如照片、卡通)的图像。在测试时,我们希望像零样本学习那样识别未见类别(如狗、长颈鹿),像领域泛化那样在未见领域(如画作)中进行识别,并利用描述已见和未见类别的辅助信息。


To our knowledge, our work [162] is the first attempt to answer this question, proposing a method that is able to recognize unseen semantic categories in unseen domains. In particular, our goal is to jointly tackle ZSL and DG (see Fig.4.1). ZSL algorithms usually receive as input a set of images with their associated semantic descriptions, and learn the relationship between an image and its semantic attributes. Likewise, DG approaches are trained on multiple source domains and at test time are asked to classify images, assigning labels within the same set of source categories but in an unseen target domain. Here we want to address the scenario where, during training, we are given a set of images of multiple domains and semantic categories and our goal is to build a model able to recognize images of unseen concepts, as in ZSL, in unseen domains, as in DG.
据我们所知,我们的工作[162]是首次尝试回答这个问题,提出了一种能够在未见领域中识别未见语义类别的方法。具体而言,我们的目标是同时解决零样本学习和领域泛化问题(见图4.1)。零样本学习算法通常将一组带有相关语义描述的图像作为输入,并学习图像与其语义属性之间的关系。同样,领域泛化方法在多个源领域上进行训练,在测试时需要对图像进行分类,在相同的源类别集合内为未见目标领域的图像分配标签。在这里,我们想解决这样一种场景:在训练期间,我们被提供了多个领域和语义类别的一组图像,我们的目标是构建一个模型,使其能够像零样本学习那样识别未见概念的图像,像领域泛化那样在未见领域中进行识别。

To achieve this, we need to address challenges usually not present when these two classical tasks, i.e. ZSL and DG, are considered in isolation. For instance, while in DG we can rely on the fact that the multiple source domains permit to disentangle semantic and domain-specific information, in ZSL+DG we have no guarantee that the disentanglement will hold for the unseen semantic categories at test time. Moreover, while in ZSL it is reasonable to assume that the learned mapping between images and semantic attributes will generalize also to test images of the unseen concepts, in ZSL+DG we have no guarantee that this will happen for images of unseen domains.
为了实现这一目标,我们需要应对在单独考虑这两个经典任务(即零样本学习和领域泛化)时通常不会出现的挑战。例如,在领域泛化中,我们可以依靠多个源领域来分离语义信息和特定领域信息,但在零样本学习 + 领域泛化中,我们无法保证在测试时这种分离对于未见语义类别仍然有效。此外,在零样本学习中,合理假设所学习的图像和语义属性之间的映射也能推广到未见概念的测试图像上,但在零样本学习 + 领域泛化中,我们无法保证对于未见领域的图像也会如此。
To overcome these issues, during training we simulate both the semantic and the domain shift we will encounter at test time. Since explicitly generating images of unseen domains and concepts is an ill-posed problem, we sidestep this issue and we synthesize unseen domains and concepts by interpolating existing ones. To do so, we revisit the mixup [301] algorithm as a tool to obtain partially unseen categories and domains. Indeed, by randomly mixing samples of different categories we obtain new samples which do not belong to a single one of the available categories during training. Similarly, by mixing samples of different domains, we obtain new samples which do not belong to a single source domain available during training.
为了克服这些问题,在训练过程中,我们模拟了在测试时会遇到的语义和领域偏移。由于显式生成未见领域和概念的图像是一个不适定问题,我们绕过了这个问题,通过对现有领域和概念进行插值来合成未见的领域和概念。为此,我们重新审视了混合(mixup)[301]算法,将其作为获取部分未见类别和领域的工具。实际上,通过随机混合不同类别的样本,我们可以得到在训练期间不属于任何一个可用类别的新样本。同样,通过混合不同领域的样本,我们可以得到在训练期间不属于任何一个可用源领域的新样本。

Under this perspective, mixing samples of both different domains and classes allows to obtain samples that cannot be categorized in a single class and domain of the one available during training, thus they are novel both for the semantic and their visual representation. Since higher levels of abstraction contain more task-related information, we perform mixup at both image and feature level, showing experimentally the need for this choice. Moreover, we introduce a curriculum-based mixing strategy to generate increasingly complex training samples. We show that our CuMix (Curriculum Mixup for recognizing unseen categores in unseen domains) model obtains state-of-the-art performances in both ZSL and DG in standard benchmarks and it can be effectively applied to the combination of the two tasks,recognizing unseen categories in unseen domains. 1
从这个角度来看,混合不同领域和类别的样本可以得到无法被归类到训练期间可用的单一类别和领域中的样本,因此这些样本在语义和视觉表示上都是新颖的。由于更高层次的抽象包含更多与任务相关的信息,我们在图像和特征层面都进行了混合操作,并通过实验证明了这种选择的必要性。此外,我们引入了一种基于课程的混合策略,以生成越来越复杂的训练样本。我们表明,我们的CuMix(用于在未见领域中识别未见类别的课程混合,Curriculum Mixup for recognizing unseen categores in unseen domains)模型在标准基准测试中的零样本学习(ZSL)和领域泛化(DG)任务中都取得了最先进的性能,并且可以有效地应用于这两个任务的组合,即在未见领域中识别未见类别。1

To summarize, the contributions of this chapter are: (i) We introduce the ZSL+DG scenario, a first step towards recognizing unseen categories in unseen domains. (ii) We describe CuMix , the first holistic method able to address ZSL, DG, and the two tasks together. Our method is based on simulating new domains and categories during training by mixing the available training domains and classes both at image and feature level. The mixing strategy becomes increasingly more challenging during training, in a curriculum fashion. (iii) Through our extensive evaluations and analysis, we show the effectiveness of CuMix in all three settings: namely ZSL, DG and ZSL+DG.
综上所述,本章的贡献如下:(i)我们引入了ZSL + DG场景,这是迈向在未见领域中识别未见类别的第一步。(ii)我们介绍了CuMix,这是第一种能够同时处理ZSL、DG以及这两个任务的整体方法。我们的方法基于在训练期间通过在图像和特征层面混合可用的训练领域和类别来模拟新的领域和类别。在训练过程中,混合策略以课程式的方式变得越来越具有挑战性。(iii)通过广泛的评估和分析,我们证明了CuMix在所有三种设置下的有效性,即ZSL、DG和ZSL + DG。

Problem statement. In this chapter, we will considered the ZSL+DG problem. Differently from the incremental learning methods presented in 3 , here we assume that our new semantic concepts is not available in the form of a training set, but is contained in a semantic descriptor which we receive at test time. Using the semantic descriptors for training classes, we can learn how to match visual features, generalizing their available during training, we can match In the ZSL+DG problem, the goal is to recognize unseen categories (as in ZSL) in unseen domains (as in DG). Formally,let X denote the input space (e.g. the image space), Y the set of possible classes and D the set of possible domains. During training,we are given a set S={(xi,yi,di)}i=1n where xiX,yiYs and diDs . Note that YsY and DsD and,as in standard DG,we have multiple source domains (i.e. Ds=j=1mdj , with m>1 ) with different distributions i.e. pX(xdi)pX(xdj),ij . For simplicity, in this section we assume to have exact knowledge about the domain label of each sample.
问题陈述。在本章中,我们将考虑ZSL + DG问题。与第3章中介绍的增量学习方法不同,这里我们假设我们的新语义概念不是以训练集的形式提供的,而是包含在我们在测试时收到的语义描述符中。使用训练类别的语义描述符,我们可以学习如何匹配视觉特征,对训练期间可用的特征进行泛化,从而进行匹配。在ZSL + DG问题中,目标是在未见领域(如领域泛化问题)中识别未见类别(如零样本学习问题)。形式上,设X表示输入空间(例如图像空间),Y表示可能的类别集合,D表示可能的领域集合。在训练期间,我们得到一个集合S={(xi,yi,di)}i=1n,其中xiX,yiYsdiDs。注意,YsYDsD,并且与标准的领域泛化问题一样,我们有多个具有不同分布的源领域(即Ds=j=1mdj,其中m>1),即pX(xdi)pX(xdj),ij。为了简单起见,在本节中我们假设对每个样本的领域标签有确切的了解。


1 The code is available at https://github.com/mancinimassimiliano/CuMix
1代码可在https://github.com/mancinimassimiliano/CuMix获取


In the ZSL+DG problem, the goal is to recognize unseen categories (as in ZSL) in unseen domains (as in DG). Formally,let X denote the input space (e.g. the image space), Y the set of possible classes and D the set of possible domains. During training,we are given a set S={(xi,yi,di)}i=1n where xiX,yiYs and diDs . Note that YsY and DsD and,as in standard DG,we have multiple source domains (i.e. Ds=j=1mdj ,with m>1 ) with different distributions i.e. pX(xdi)pX(xdj),ij . Given S our goal is to learn a function h mapping an image x of domains DuD to its corresponding label in a set of classes YuY . Note that in standard ZSL, while the set of train and test domains are shared, i.e. DsDu ,the label sets are disjoint i.e. YsYu ,thus Yu is a set of unseen classes. On the other hand,in DG we have a shared output space,i.e. YsYu , but a disjoint set of domains between training and test i.e. DsDu ,thus Du is a set of unseen domains. Since the goal of our work is to recognize unseen classes in unseen domains, we unify the settings of DG and ZSL, considering both semantic-and domain shift at test time i.e. YsYu and DsDu .
在零样本学习+领域泛化(ZSL+DG)问题中,目标是识别未见领域(如领域泛化问题)中的未见类别(如零样本学习问题)。形式上,设X表示输入空间(例如图像空间),Y表示可能的类别集合,D表示可能的领域集合。在训练过程中,我们会得到一个集合S={(xi,yi,di)}i=1n,其中xiX,yiYsdiDs。注意,YsYDsD,并且与标准的领域泛化问题一样,我们有多个具有不同分布的源领域(即Ds=j=1mdj,其中m>1),即pX(xdi)pX(xdj),ij。给定S,我们的目标是学习一个函数h,该函数将领域DuD中的图像x映射到类别集合YuY中对应的标签。注意,在标准的零样本学习中,虽然训练和测试领域集合是相同的,即DsDu,但标签集合是不相交的,即YsYu,因此Yu是未见类别的集合。另一方面,在领域泛化中,我们有一个共享的输出空间,即YsYu,但训练和测试之间的领域集合是不相交的,即DsDu,因此Du是未见领域的集合。由于我们工作的目标是识别未见领域中的未见类别,我们统一了领域泛化和零样本学习的设置,在测试时同时考虑语义和领域的偏移,即YsYuDsDu

4.2 Related Works
4.2 相关工作


In this section, we review related works in ZSL, and works trying to perform DA and/or DG with techniques linked to the mixup algorithm which serves as the base for our method. We will also describe works addressing ZSL under domain shift and/or DG with different semantic spaces, highlighting the differences with our setting.
在本节中,我们回顾零样本学习方面的相关工作,以及尝试使用与混合算法(mixup algorithm)相关的技术进行领域自适应(DA)和/或领域泛化(DG)的工作,混合算法是我们方法的基础。我们还将描述在领域偏移和/或不同语义空间下处理零样本学习的工作,并突出与我们设置的差异。

Zero-Shot Learning (ZSL). Traditional ZSL approaches attempt to learn a projection function mapping images/visual features to a semantic embedding space where classification is performed. This idea is achieved by directly predicting image attributes e.g. [130] or by learning a linear mapping through margin-based objective functions [1,2] . Other approaches explored the use of non-linear multimodal embeddings [276], intermediate projection spaces [303, 304] or similarity-based interpolation of base classifiers [34]. Recently, various methods tackled ZSL from a generative point of view considering Generative Adversarial Networks [279], Variational Autoencoders (VAE) [235] or both of them [281]. While none of these approaches explicitly tackled the domain shift, i.e. visual appearance changes among different domains/datasets, various methods proposed to use domain adaptation technique, e.g. to refine the semantic embedding space, aligning semantic and projected visual features [235] or, in transductive scenarios, to cope with the inherent domain shift existing among the appearance of attributes in different categories [119,75,76] . For instance,in [235] a distance among visual and semantic embedding projected in the VAE latent space is minimized. In [119] the problem is addressed through a regularised sparse coding framework, while in [75] a multi-view hypergraph label propagation framework is introduced.
零样本学习(Zero-Shot Learning,ZSL)。传统的零样本学习方法试图学习一个投影函数,将图像/视觉特征映射到一个语义嵌入空间,在该空间中进行分类。这一想法可以通过直接预测图像属性(例如文献[130])或通过基于边界的目标函数学习线性映射(公式[1,2])来实现。其他方法探索了使用非线性多模态嵌入(文献[276])、中间投影空间(文献[303, 304])或基于相似度的基分类器插值(文献[34])。最近,各种方法从生成的角度处理零样本学习问题,考虑使用生成对抗网络(文献[279])、变分自编码器(Variational Autoencoders,VAE,文献[235])或两者结合(文献[281])。虽然这些方法都没有明确处理领域偏移问题,即不同领域/数据集之间的视觉外观变化,但各种方法建议使用领域自适应技术,例如细化语义嵌入空间、对齐语义和投影视觉特征(文献[235]),或者在直推式场景中,处理不同类别属性外观之间存在的固有领域偏移(公式[119,75,76])。例如,在文献[235]中,最小化了投影到变分自编码器潜在空间中的视觉和语义嵌入之间的距离。在文献[119]中,该问题通过正则化稀疏编码框架解决,而在文献[75]中,引入了多视图超图标签传播框架。

Recently, works have considered also coupling ZSL and DA in a transductive setting. For instance, in [312] a semantic guided discrepancy measure is employed to cope with the asymmetric label space among source and target domains. In the context of image retrieval, multiple works addressed the sketch-based image retrieval problem [294, 61], even across multiple domains. In [257] the authors proposed a method to perform cross-domain image retrieval by training domain-specific experts. While these approaches integrated DA and ZSL, none of them considered the more complex scenario of DG, where no target data are available.
最近,一些工作也考虑在直推式设置中结合零样本学习和领域自适应。例如,在文献[312]中,采用了一种语义引导的差异度量来处理源域和目标域之间不对称的标签空间。在图像检索的背景下,多项工作解决了基于草图的图像检索问题(文献[294, 61]),甚至跨多个领域。在文献[257]中,作者提出了一种通过训练特定领域专家进行跨领域图像检索的方法。虽然这些方法集成了领域自适应和零样本学习,但它们都没有考虑领域泛化这一更复杂的场景,在该场景中没有目标数据可用。

Simulating the Domain Shift for Domain Generalization. As highlighted in Section 2.2, multiple research efforts have been recently devoted into addressing the domain generalization problem. Here we will recall some of them that are linked to the idea behind of the approach we will present in the next section. For a more detailed overview of DG works, we ask the reader to refer to Section 2.2.
为领域泛化模拟领域偏移。如第2.2节所述,最近多项研究工作致力于解决领域泛化问题。在这里,我们将回顾其中一些与我们在下一节将介绍的方法背后的思想相关的工作。关于领域泛化工作的更详细概述,请读者参考第2.2节。

In particular, since we mix samples to simulate new domains, our approach is linked with data and feature augmentation strategies for DG [238, 268, 267]. Among them, we can distinguish two main categories: adversarial-based [238, 268, 310, 311], trying to simulate novel domains through adversarial perturbations of the original input, and data augmentation-based [267], which determines which augmentations to perform in order to improve the generalization capabilities of the model. Differently from these methods, we will specifically employ mixup to perturb input and feature representations.
特别地,由于我们混合样本以模拟新的领域,我们的方法与用于领域泛化的数据和特征增强策略(文献[238, 268, 267])相关。其中,我们可以区分出两个主要类别:基于对抗的方法(文献[238, 268, 310, 311]),试图通过对原始输入进行对抗性扰动来模拟新的领域;以及基于数据增强的方法(文献[267]),该方法确定要执行哪些增强操作以提高模型的泛化能力。与这些方法不同,我们将专门使用混合(mixup)方法来扰动输入和特征表示。
Similarly, the fact that mixed samples are made increasingly more difficult during training, has a link with episodic strategies for domain generalization, such as [135]. In [135], the authors describe a DG procedure which is based on multiple domain-specific and one domain-agnostic networks. During training, a domain-specific feature extractor receives as input images of different domains (i.e. with a different distributions) that the domain agnostic predictor is asked to correctly classify. Vice-versa, the domain-agnostic feature extractor must learn to extract features which even a domain-specific classifier of a different domain (with respect to the one of the input image) should correctly classify. In this way, the domain-agnostic components learn to cope with domain shift in their inputs, similarly to what they will experience at test time. In our method, we will not require domain-specific components, but we will simulate the domain shift by gradually increasing the challenge posed by the mixed samples.
同样,在训练过程中使混合样本变得越来越难的做法与领域泛化的情节式策略(如文献[135])相关。在文献[135]中,作者描述了一种基于多个特定领域网络和一个领域无关网络的领域泛化过程。在训练期间,特定领域的特征提取器接收不同领域(即具有不同分布)的图像作为输入,要求领域无关的预测器对这些图像进行正确分类。反之,领域无关的特征提取器必须学习提取特征,使得即使是不同领域(相对于输入图像的领域)的特定领域分类器也能正确分类。通过这种方式,领域无关的组件学会处理输入中的领域偏移,类似于它们在测试时将遇到的情况。在我们的方法中,我们不需要特定领域的组件,而是通过逐渐增加混合样本带来的挑战来模拟领域偏移。

Recently, works have considered mixup in the context of domain adaptation [285] to e.g. reinforce the judgments of a domain discrimination. However, we employ mixup from a different perspective i.e. simulating semantic and domain shift we will encounter at test time. To this extent, we are not aware of previous methods using mixup for DG and ZSL.
最近,一些工作在领域自适应的背景下考虑使用混合(mixup)方法(文献[285]),例如增强领域判别判断。然而,我们从不同的角度使用混合方法,即模拟我们在测试时将遇到的语义和领域偏移。就此而言,我们不知道之前有使用混合方法进行领域泛化和零样本学习的方法。

Finally, works have recently considered the heterogeneous domain generalization (HDG) problem [135, 143]. The goal of HDG is to train a feature extractor able to produce useful representations for novel domains and novel categories [143]. The novel domains have their specific output space (as in MDL, see Section 3.3). Despite data of novel domains and classes are not present during the feature extractor training phase, data of the novel domains are required to train a classifier for the new domains/categories on top of the agnostic feature extractor. Our ZSL+DG is different since we assume that a model is trained once and uses side information (e.g. word embeddings) to classify unseen categories in unseen domains at the test time, without any training samples for new domains and categories.
最后,近期的研究工作考虑了异构领域泛化(Heterogeneous Domain Generalization,HDG)问题 [135, 143]。HDG 的目标是训练一个特征提取器,使其能够为新领域和新类别生成有用的表示 [143]。新领域有其特定的输出空间(如多领域学习(Multi - Domain Learning,MDL),见第 3.3 节)。尽管在特征提取器训练阶段不存在新领域和新类别的数据,但需要新领域的数据来在与领域无关的特征提取器之上为新领域/类别训练一个分类器。我们的零样本学习 + 领域泛化(Zero - Shot Learning + Domain Generalization,ZSL + DG)有所不同,因为我们假设模型只需训练一次,并在测试时使用辅助信息(例如词嵌入)对未见领域中的未见类别进行分类,而无需新领域和新类别的任何训练样本。

4.3 Recognizing Unseen Categories in Unseen Domains 2
4.3 识别未见领域中的未见类别 2


4.3.1 Preliminaries
4.3.1 预备知识


From the definitions of Section 4.1, we recall that our goal is to learn a function h mapping an image x of unseen domains DuD to its corresponding label in a set of unseen classes YuY .
根据第 4.1 节的定义,我们回顾一下,我们的目标是学习一个函数 h,该函数将未见领域 DuD 的图像 x 映射到未见类别集合 YuY 中对应的标签。

In the following we divide the function h into three parts: f ,mapping images into a feature space Z ,i.e. f:XZ,g going from Z to a semantic embedding space E ,i.e. g:ZE ,and an embedding function ω:YtE where YtYs during training and YtYu at test time. Note that ω is a learned classifier for DG while it is a fixed semantic embedding function in ZSL, mapping classes into their vectorized representation extracted from external sources. Given an image x ,the final class prediction is obtained as follows:
接下来,我们将函数 h 分为三个部分:f,将图像映射到特征空间 Z,即 f:XZ,g;从 Z 到语义嵌入空间 E,即 g:ZE;以及一个嵌入函数 ω:YtE,其中在训练时为 YtYs,在测试时为 YtYu。请注意,ω 在领域泛化(Domain Generalization,DG)中是一个学习得到的分类器,而在零样本学习(Zero - Shot Learning,ZSL)中是一个固定的语义嵌入函数,它将类别映射到从外部源提取的向量化表示。给定一幅图像 x,最终的类别预测如下获得:

(4.1)y=argmaxyω(y)g(f(x)).

In this formulation, f can be any learnable feature extractor (e.g. a deep neural network),while g any ZSL predictor (e.g. a semantic projection layer,as in [277] or a compatibility function among visual features and labels, as in [1, 2]). The first solution to address the ZSL+DG problem could be training a classifier using the aggregation of data from all source domains. In particular, for each sample we could minimize a loss function of the form:
在这个公式中,f 可以是任何可学习的特征提取器(例如深度神经网络),而 g 可以是任何零样本学习预测器(例如语义投影层,如文献 [277] 中所述,或者视觉特征和标签之间的兼容性函数,如文献 [1, 2] 中所述)。解决零样本学习 + 领域泛化问题的第一种方法可以是使用来自所有源领域的数据聚合来训练一个分类器。具体来说,对于每个样本,我们可以最小化以下形式的损失函数:

(4.2)LAGG(xi,yi)=yYs(ω(y)g(f(xi)),yi)

with an arbitrary loss function,e.g. the cross-entropy loss. In the following,we show how we can use the input to Eq. (4.2) to effectively recognize unseen categories in unseen domains.
其中 是一个任意的损失函数,例如交叉熵损失。接下来,我们将展示如何使用公式 (4.2) 的输入来有效识别未见领域中的未见类别。

4.3.2 Simulating Unseen Domains and Concepts through Mixup
4.3.2 通过混合(Mixup)模拟未见领域和概念


The fundamental problem of ZSL+DG is that, during training, we have neither access to visual data associated to categories in Yu nor to data of the unseen domains Du . One way to overcome this issue in ZSL is to generate samples of unseen classes by learning a generative function conditioned on the semantic embeddings in W={ω(y)yYs}[279,281] . However,since no description is available for the unseen target domain(s) in Du ,this strategy is not feasible in ZSL+DG. On the other hand, previous works on DG proposed to synthesize images of unseen domains through adversarial strategies of data augmentation [268, 238]. However, these strategies are not applied to ZSL since they cannot easily be extended to generate data for unseen semantic categories Yu .
零样本学习 + 领域泛化的根本问题在于,在训练期间,我们既无法获取与 Yu 中的类别相关的视觉数据,也无法获取未见领域 Du 的数据。在零样本学习中克服这个问题的一种方法是通过学习一个以 W={ω(y)yYs}[279,281] 中的语义嵌入为条件的生成函数来生成未见类别的样本。然而,由于 Du 中未见目标领域没有可用的描述,这种策略在零样本学习 + 领域泛化中不可行。另一方面,先前关于领域泛化的研究工作提出通过对抗性数据增强策略来合成未见领域的图像 [268, 238]。然而,这些策略不适用于零样本学习,因为它们不容易扩展以生成未见语义类别 Yu 的数据。

To circumvent this issue, we introduce a strategy to simulate, during training, novel domains and semantic concepts by interpolating from the ones available in Ds and Ys . Simulating novel domains and classes allows to train the network to cope with both semantic- and domain shift, the same situation our model will face at test time. Since explicitly generating inputs of novel domains and categories is a complex task, in this section we propose to achieve this goal, by mixing images and features of different classes and domains, revisiting the popular mixup [301] strategy.
为了规避这个问题,我们引入了一种策略,在训练期间通过对DsYs中可用的领域和语义概念进行插值来模拟新的领域和语义概念。模拟新的领域和类别可以训练网络来应对语义和领域的偏移,这也是我们的模型在测试时会面临的情况。由于显式生成新领域和类别的输入是一项复杂的任务,在本节中,我们建议通过混合不同类别和领域的图像和特征来实现这一目标,重新采用流行的混合(mixup)[301]策略。


2 M. Mancini,Z. Akata,E. Ricci,B. Caputo. Towards Recognizing Unseen Categories in Unseen Domains. European Computer Vision Conference (ECCV) 2020.
2 M. Mancini,Z. Akata,E. Ricci,B. Caputo。迈向识别未见领域中的未见类别。欧洲计算机视觉会议(ECCV)2020。




https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_160.jpg?x=275&y=262&w=1103&h=506&r=0

Figure 4.2. Our CuMix Framework. Given an image (bottom, horse, photo), we randomly sample one image from the same (middle, photo) and one from another (top, cartoon) domain. The samples are mixed through ϕ (white blocks) both at image and feature level,with their features and labels projected into the embedding space E (by g and ω respectively) and there compared to compute the final objective. Note that ϕ varies during training (top part), changing the mixing ratios in and across domains.
图4.2. 我们的CuMix框架。给定一张图像(底部,马,照片),我们从相同领域(中间,照片)随机采样一张图像,从另一个领域(顶部,卡通)随机采样一张图像。这些样本在图像和特征层面通过ϕ(白色方块)进行混合,它们的特征和标签分别通过gω投影到嵌入空间E中,并在那里进行比较以计算最终目标。请注意,ϕ在训练期间会发生变化(顶部部分),改变领域内和跨领域的混合比例。


In practice,given two elements ai and aj of the same space (e.g. ai,ajX ), mixup [301] defines a mixing function φ as follows:
实际上,给定同一空间的两个元素aiaj(例如ai,ajX),混合(mixup)[301]将混合函数φ定义如下:

(4.3)φ(ai,aj)=λai+(1λ)aj

with λ sampled from a beta distribution,i.e. λBeta(β,β) ,with β an hyperparam-eter. Given two samples (xi,yi) and (xj,yj) randomly drawn from a training set T , a new loss term is defined as:
其中λ是从贝塔分布中采样得到的,即λBeta(β,β),其中β是一个超参数。给定从训练集T中随机抽取的两个样本(xi,yi)(xj,yj),定义一个新的损失项为:

(4.4)LMIXUP ((xi,yi),(xj,yj))=LAGG(φ(xi,xj),φ(y¯i,y¯j))

where y¯i|Ys| is the one-hot vectorized representation of label yi . Note that,when mixing two samples and label vectors with φ ,a single λ is drawn and applied within φ in both image and label spaces. The loss defined in Eq.(4.4) forces the network to disentangle the various semantic components (i.e. yi and yj ) contained in the mixed inputs (i.e. xi and xj ) plus the ratio λ used to mix them. This auxiliar task acts as a strong regularizer that helps the network to e.g. being more robust against adversarial examples [301]. Note however that the function φ creates input and targets which do not represent a single semantic concept in T but contains characteristics taken from multiple samples and categories, synthesising a new semantic concept from the interpolation of existing ones.
其中y¯i|Ys|是标签yi的独热向量表示。请注意,当使用φ混合两个样本和标签向量时,会抽取一个单一的λ并在图像和标签空间的φ中应用。式(4.4)中定义的损失迫使网络解开混合输入(即xixj)中包含的各种语义成分(即yiyj)以及用于混合它们的比例λ。这个辅助任务起到了强大的正则化作用,有助于网络例如对对抗样本更加鲁棒[301]。然而,请注意,函数φ创建的输入和目标并不代表T中的单一语义概念,而是包含了从多个样本和类别中提取的特征,通过对现有概念的插值合成了一个新的语义概念。

For recognizing unseen concepts in unseen domains at test time,we revisit φ to obtain both cross-domain and cross-semantic mixes during training, simulating both semantic- and domain shift. While simulating the semantic shift is a by-product of the original mixup formulation,here we explicitly revisit φ in order to perform cross-domain mixups. In particular, instead of considering a pair of samples from our training set,we consider a triplet (xi,yi,di),(xj,yj,dj) and (xk,yk,dk) . Given (xi,yi,di) ,the other two elements of the triplet are randomly sampled from S ,with the only constraint that di=dk,ik and djdi . In this way,the triplet contains two samples of the same domain (i.e. di ) and a third of a different one (i.e. dj ). Then,our mixing function ϕ is defined as follows:
为了在测试时识别未见领域中的未见概念,我们重新审视了φ,以便在训练期间同时获得跨领域和跨语义的混合,模拟语义和领域的迁移。虽然模拟语义迁移是原始混合(mixup)公式的一个副产品,但在这里我们明确地重新审视φ,以执行跨领域混合。具体来说,我们不考虑从训练集中选取一对样本,而是考虑一个三元组(xi,yi,di),(xj,yj,dj)(xk,yk,dk)。给定(xi,yi,di),三元组的另外两个元素从S中随机采样,唯一的约束条件是di=dk,ikdjdi。通过这种方式,该三元组包含来自同一领域的两个样本(即di)和来自不同领域的第三个样本(即dj)。然后,我们的混合函数ϕ定义如下:
(4.5)ϕ(ai,aj,ak)=λai+(1λ)(γaj+(1γ)ak)

with γ sampled from a Bernoulli distribution γB(α) and a representing either the input x or the vectorized version of the label y ,i.e. y¯ . Note that we introduced a term γ which allows to perform either intra-domain (with γ=0 ) or cross-domain (with γ=1 ) mixes.
其中γ从伯努利分布γB(α)中采样,a表示输入x或标签y的向量化版本,即y¯。请注意,我们引入了一个项γ,它允许执行域内混合(当γ=0时)或跨域混合(当γ=1时)。

To learn a feature extractor f and a semantic projection layer g robust to domain-and semantic shift,we propose to use ϕ to simulate both samples and features of novel domains and classes during training. Namely, we simulate the semantic- and domain shift at two levels,i.e. image and class levels. Given a sample (xi,yi,di)S we define the following loss:
为了学习一个对领域和语义迁移具有鲁棒性的特征提取器f和语义投影层g,我们建议在训练期间使用ϕ来模拟新领域和新类别的样本和特征。具体来说,我们在两个层面上模拟语义和领域的迁移,即图像层面和类别层面。给定一个样本(xi,yi,di)S,我们定义以下损失函数:

(4.6)LM-IMG (xi,yi,di)=LAGG(ϕ(xi,xj,xk),ϕ(y¯i,y¯j,y¯k)).

where (xi,yi,di),(xj,yj,dj),(xk,yk,dk) are randomly sampled from S ,with di=dk and djdk . The loss term in Eq. (4.6) enforces the feature extractor to effectively process inputs of mixed domains/semantics obtained through ϕ . Inspired by [264],we design an additional loss acting at the classification level, by enforcing the semantic consistency of mixed features in E . This loss term is defined as:
其中(xi,yi,di),(xj,yj,dj),(xk,yk,dk)S中随机采样,且di=dkdjdk。公式(4.6)中的损失项迫使特征提取器有效地处理通过ϕ获得的混合领域/语义的输入。受文献[264]的启发,我们设计了一个在分类层面起作用的额外损失,通过强制E中混合特征的语义一致性。这个损失项定义为:

(4.7)LMF(xi,yi,di)=yYs(ω(y)g(ϕ(f(xi),f(xj),f(xk))),ϕ(y¯i,y¯j,y¯k))

where,as before, (xj,yj,dj),(xk,yk,dk)S ,with di=dk,ik and djdk and is a generic loss function e.g. the cross-entropy loss. This second loss term forces the classifier ω and the semantic projection layer g to be robust to features with mixed domains and semantics.
其中,和之前一样,(xj,yj,dj),(xk,yk,dk)S,其中di=dk,ikdjdk,并且是一个通用的损失函数,例如交叉熵损失。这第二个损失项迫使分类器ω和语义投影层g对具有混合领域和语义的特征具有鲁棒性。

While we can simply use a fixed mixing function ϕ ,as defined in Eq. (4.5),for Eq. (4.6) and Eq. (4.7),we found that it is more beneficial to devise a dynamic ϕ which changes its behaviour during training, in a curriculum fashion. Intuitively, minimizing the two objectives defined in Eq.(4.6) and Eq.(4.7) requires our model to correctly disentangle the various semantic components used to form the mixed samples. While this is a complex task even for intra-domain mixes (i.e. when only the semantic is mixed), mixing samples across domains makes the task even harder, requiring to isolate also domain-specific factors. To effectively tackle this task, we choose to act on the mixing function ϕ . In particular,we want our ϕ to create mixed samples with progressively increased degree of mixing both with respect to content and domain, in a curriculum-based fashion.
虽然我们可以简单地使用一个固定的混合函数 ϕ(如公式 (4.5) 所定义)来处理公式 (4.6) 和公式 (4.7),但我们发现设计一个动态的 ϕ 更有益,该函数在训练过程中以课程式的方式改变其行为。直观地说,最小化公式 (4.6) 和公式 (4.7) 中定义的两个目标要求我们的模型正确地解开用于形成混合样本的各种语义成分。即使对于域内混合(即仅混合语义时),这也是一项复杂的任务,而跨域混合样本会使任务变得更加困难,还需要分离特定于域的因素。为了有效地处理这项任务,我们选择对混合函数 ϕ 进行操作。具体来说,我们希望我们的 ϕ 以基于课程的方式创建在内容和域方面混合程度逐渐增加的混合样本。

During training we regulate both α (weighting the probability of cross-domain mixes) and β (modifying the probability distribution of the mix ratio λ ),changing the probability distribution of the mixing ratio λ and of the cross-domain mix γ . In particular,given a warm-up step of N epochs and being s the current epoch we set β=min(sNβmax,βmax) ),with βmax as hyperparameter,while α= max(0,min(sNN,1). As a consequence,the learning process is made of three phases,) with a smooth transition among them. We start by solving the plain classification task on a single domain (i.e. s<N,α=0,β=sNβmax ,). In the subsequent step (Ns<2N) samples of the same domains are mixed randomly,with possibly different semantics (i.e. α=sNN,β=βmax ). In the third phase (s2N) ,we mix up samples of different domains (i.e. α=1 ),simulating the domain shift the predictor will face at test time. Figure 4.2,shows a representation of how ϕ varies during training (top, white block).
在训练过程中,我们同时调节 α(对跨域混合的概率进行加权)和 β(修改混合比例 λ 的概率分布),从而改变混合比例 λ 和跨域混合 γ 的概率分布。具体来说,给定 N 个轮次的预热步骤,且 s 为当前轮次,我们设置 β=min(sNβmax,βmax)(其中 βmax 为超参数),而 α= max(0,min(sNN,1). As a consequence,the learning process is made of three phases,) 之间有平滑过渡。我们首先解决单个域上的简单分类任务(即 s<N,α=0,β=sNβmax)。在后续步骤 (Ns<2N) 中,同一域的样本被随机混合,可能具有不同的语义(即 α=sNN,β=βmax)。在第三阶段 (s2N),我们混合不同域的样本(即 α=1),模拟预测器在测试时将面临的域偏移。图 4.2 展示了 ϕ 在训练过程中如何变化的表示(顶部,白色块)。
Final objective. The full training procedure, is represented in Figure 4.2. Given a training sample (xi,yi,di) ,we randomly draw other two samples, (xj,yj,dj) and (xk,yk,dk) ,with di=dk,ik and djdi ,feed them to ϕ and obtain the first mixed input. We then feed xi,xj,xk and the mixed sample through f ,to extract their respective features. At this point we use features extracted from other two randomly drawn samples (in the figure,and just for simplicity, xj and xk with same mixing ratios λ and γ ),to obtain the feature level mixed features needed to build the objective in Eq.(4.7). Finally,the features of xi and the two mixed variants at image and feature level,are fed to the semantic projection layer g ,which maps them to the embedding space E . At the same time,the labels in Ys are projected in E through ω . Finally, the objectives defined in Eq.(4.2),Eq.(4.6) and Eq.(4.7) functions are then computed in the semantic embedding space. Our final objective is:
最终目标。完整的训练过程如图 4.2 所示。给定一个训练样本 (xi,yi,di),我们随机抽取另外两个样本 (xj,yj,dj)(xk,yk,dk),其中 di=dk,ikdjdi,将它们输入到 ϕ 中并获得第一个混合输入。然后,我们将 xi,xj,xk 和混合样本通过 f 输入,以提取它们各自的特征。此时,我们使用从另外两个随机抽取的样本(在图中,为简单起见,xjxk 具有相同的混合比例 λγ)中提取的特征,以获得构建公式 (4.7) 中的目标所需的特征级混合特征。最后,xi 的特征以及图像和特征级的两个混合变体被输入到语义投影层 g,该层将它们映射到嵌入空间 E。同时,Ys 中的标签通过 ω 投影到 E 中。最后,在语义嵌入空间中计算公式 (4.2)、公式 (4.6) 和公式 (4.7) 中定义的目标函数。我们的最终目标是:

(4.8)LCuMIX (S)=|S|1(xi,yi,di)SLAGG (xi,yi)+ηILM-IMG (xi,yi,di)+ηFLM-F (xi,yi,di)(xi,yi,di)S

with ηI and ηF hyperparameters weighting the importance of the two terms. As (x,y) in both LAGG,LM-IMG  and LM-F  ,we use the standard cross-entropy loss, even if any ZSL objective can be applied. Finally, we highlight that the optimization is performed batch-wise, thus also the sampling of the triplet considers the current batch and not the full training set S . Moreover,while in Figure 4.2 we show for simplicity that the same samples are drawn for LMIMG and LMF ,in practice,given a sample, the random sampling procedure of the other two members of the triplet is held-out twice, one at the image level and one at the feature level. Similarly, the sampling of the mixing ratios λ and cross domain factor γ of ϕ is held-out sample-wise and twice, one at image level and one at feature level. As in Eq. (4.3), λ and γ are kept fixed across mixed inputs/features and their respective targets in the label space.
其中ηIηF是对这两项重要性进行加权的超参数。由于LAGG,LM-IMG LM-F 中都有(x,y),我们使用标准的交叉熵损失,即使可以应用任何零样本学习(ZSL)目标。最后,我们强调优化是按批次进行的,因此三元组的采样也考虑当前批次,而不是整个训练集S。此外,虽然在图4.2中,为简单起见,我们展示了为LMIMGLMF抽取相同的样本,但实际上,给定一个样本,三元组中另外两个成员的随机采样过程会进行两次,一次在图像层面,一次在特征层面。同样,ϕ的混合比例λ和跨域因子γ的采样也是按样本进行两次,一次在图像层面,一次在特征层面。如式(4.3)所示,λγ在混合输入/特征及其在标签空间中的相应目标之间保持固定。

Discussion. We present similarities between our CuMix framework with DG and ZSL methods. In particular, presenting the classifier with noisy features extracted by a non-domain specialist network, has a similar goal as the episodic strategy for DG described in [135]. On the other hand, here we sidestep the need to train domain experts by directly presenting as input to our classifier features of novel domains that we obtain by interpolating the available sources samples. Our method is also linked to mixup approaches developed in DA [285]. Differently from them, we use mixup to simulate unseen domains rather then to progressively align the source to the given target data.
讨论。我们展示了我们的CuMix框架与领域泛化(DG)和零样本学习(ZSL)方法之间的相似性。具体而言,向分类器提供由非领域专家网络提取的噪声特征,与文献[135]中描述的领域泛化的情节策略有相似的目标。另一方面,在这里,我们通过直接将通过对可用源样本进行插值获得的新领域特征作为输入提供给我们的分类器,从而避免了训练领域专家的需要。我们的方法还与领域自适应(DA)中开发的混合(mixup)方法有关[285]。与它们不同的是,我们使用混合方法来模拟未见领域,而不是逐步将源数据与给定的目标数据对齐。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_163.jpg?x=265&y=262&w=1115&h=351&r=0

Figure 4.3. ZSL results on CUB, SUN, AWA and FLO datasets with ResNet-101 features.
图4.3. 使用ResNet - 101特征在CUB、SUN、AWA和FLO数据集上的零样本学习(ZSL)结果。


Our method is also related to ZSL frameworks based on feature generation [279,281] . While the quality of our synthesized samples is lower since we do not exploit attributes for conditional generation, we have a lower computational cost. In fact, during training we simulate the test-time semantic shift without generating samples of unseen classes. Moreover, we do not require additional training phases on the generated samples or the availability of unseen class attributes to be available beforehand.
我们的方法还与基于特征生成的零样本学习(ZSL)框架有关[279,281]。虽然我们合成样本的质量较低,因为我们没有利用属性进行条件生成,但我们的计算成本较低。事实上,在训练过程中,我们在不生成未见类别的样本的情况下模拟测试时的语义转移。此外,我们不需要对生成的样本进行额外的训练阶段,也不需要事先提供未见类别的属性。

4.3.3 Experimental results
4.3.3 实验结果


Datasets and implementation details
数据集和实现细节


We assess CuMix in three scenarios: ZSL, DG and the proposed ZSL+DG setting.
我们在三种场景下评估CuMix:零样本学习(ZSL)、领域泛化(DG)和提出的ZSL + DG设置。

ZSL. We conduct experiments on four standard benchmarks: Caltech-UCSD-Birds 200-2011 (CUB) [271], SUN attribute (SUN) [204], Animals with Attributes (AWA) [130] and Oxford Flowers (FLO) [192]. CUB contains 11,788 images of 200 bird species, with 312 attributes, SUN 14,430 images of 717 scenes annotated with 102 attributes, and AWA 30,475 images of 50 animal categories with 85 attributes. Finally, FLO is a fine-grained dataset of flowers, containing 8,189 images of 102 categories. As semantic representation, we use the visual descriptions of [217], following [279,277] . For each dataset,we use the train,validation and test split provided by [278]. In all the settings we employ features extracted from the second-last layer of a ResNet-101 [98] pre-trained on ImageNet as image representation. For CuMix ,we consider f as the identity function and as g a simple fully connected layer, performing our version of mixup directly at the feature level while applying our alignment loss in the embedding space. All hyperparameters have been set following [278].
零样本学习(ZSL)。我们在四个标准基准数据集上进行实验:加州理工学院 - 加州大学圣地亚哥分校鸟类200 - 2011(CUB)[271]、SUN属性(SUN)[204]、带属性的动物(AWA)[130]和牛津花卉(FLO)[192]。CUB包含200种鸟类的11788张图像,有312个属性;SUN包含717个场景的14430张图像,标注了102个属性;AWA包含50种动物类别的30475张图像,有85个属性。最后,FLO是一个细粒度的花卉数据集,包含102个类别的8189张图像。作为语义表示,我们遵循[279,277],使用文献[217]中的视觉描述。对于每个数据集,我们使用文献[278]提供的训练集、验证集和测试集划分。在所有设置中,我们使用在ImageNet上预训练的ResNet - 101 [98]倒数第二层提取的特征作为图像表示。对于CuMix,我们将f视为恒等函数,将g视为一个简单的全连接层,直接在特征层面执行我们版本的混合(mixup)方法,同时在嵌入空间中应用我们的对齐损失。所有超参数均按照文献[278]进行设置。

DG. We perform experiments on the PACS dataset [133]with 9,991 images of 7 semantic classes in 4 different visual domains, art paintings, cartoons, photos and sketches. For this experiment we use the standard train and test split defined in [133], with the same validation protocol. We use as base architecture a ResNet-18 [98] pre-trained on ImageNet. For our model,we consider f to be the ResNet-18 while g to be the identity function. We use the same training hyperparameters and protocol of [135]. ZSL+DG. Since no previous work addressed the problem of ZSL+DG, there is no benchmark on this task. As a valuable benchmark, we choose DomainNet [206], a recently introduced dataset for multi-source domain adaptation [206] with a large variety of domains, visual concepts and possible descriptions. It contains approximately 600'000 images from 345 categories and 6 domains, clipart, infograph, painting, quickdraw, real and sketch.
领域泛化(DG)。我们在PACS数据集[133]上进行实验,该数据集包含4个不同视觉领域(艺术绘画、卡通、照片和素描)中7个语义类别的9991张图像。对于此实验,我们使用[133]中定义的标准训练集和测试集划分,并采用相同的验证协议。我们使用在ImageNet上预训练的ResNet - 18 [98]作为基础架构。对于我们的模型,我们将f视为ResNet - 18,而g视为恒等函数。我们使用与[135]相同的训练超参数和协议。零样本学习+领域泛化(ZSL + DG)。由于之前没有工作解决过ZSL + DG问题,因此该任务没有基准。作为一个有价值的基准,我们选择DomainNet [206],这是最近推出的用于多源领域自适应[206]的数据集,具有多种领域、视觉概念和可能的描述。它包含来自345个类别和6个领域(剪贴画、信息图、绘画、快速绘图、真实图像和素描)的约600000张图像。
To convert this dataset from a DA to a ZSL scenario, we need to define an unseen set of classes. Since CuMix uses a network pre-trained on ImageNet [225], the set of unseen classes can not contain any of the classes present in ImageNet following the good practices in [280]. We build our validation + test set with 100 classes that contain at least 40 images per domain and that has no overlap with ImageNet. We reserve 45 of these classes for the unseen test set, matching the number used in [257], and the remaining 55 classes for the unseen validation set. The remaining 245 classes are used as seen classes during training.
为了将这个数据集从领域自适应(DA)场景转换为零样本学习(ZSL)场景,我们需要定义一组未见类别。由于CuMix使用在ImageNet [225]上预训练的网络,根据[280]中的最佳实践,未见类别集合不能包含ImageNet中存在的任何类别。我们用100个类别构建验证集和测试集,这些类别每个领域至少包含40张图像,并且与ImageNet没有重叠。我们预留其中45个类别作为未见测试集,与[257]中使用的数量相匹配,其余55个类别作为未见验证集。其余245个类别在训练期间用作已见类别。

We set the hyperparameters of each method by training on all the images of the seen classes on a subset of the source domains and validating on all the images of the validation set from the held-out source domain. After the hyperparameters are set, we retrain the model on the training set, i.e. 245 classes, and validation set, i.e. 55 classes, of a total number of 300 classes. Finally, we report the final results on the 45 unseen classes. As semantic representation we use word2vec embeddings [179] extracted from the Google News corpus and L2-normalized, following [257]. For all the baselines and our method, we employ as base architecture a ResNet-50 [98] pre-trained on ImageNet, using the same number of epochs and SGD with momentum as optimizer, with the same hyperparameters of [257].
我们通过在源领域子集的已见类别所有图像上进行训练,并在保留源领域验证集的所有图像上进行验证来设置每个方法的超参数。设置好超参数后,我们在训练集(即245个类别)和验证集(即55个类别)上重新训练模型,总共300个类别。最后,我们报告45个未见类别的最终结果。作为语义表示,我们使用从Google新闻语料库中提取并经过L2归一化的word2vec嵌入[179],遵循[257]的方法。对于所有基线方法和我们的方法,我们使用在ImageNet上预训练的ResNet - 50 [98]作为基础架构,使用相同的训练轮数,并采用带有动量的随机梯度下降(SGD)作为优化器,超参数与[257]相同。

Results
结果


ZSL. In the ZSL scenario, we choose as baselines standard inductive methods plus more recent approaches. In particular we report the results of ALE [1], SJE [2], SYNC [34], GFZSL [265] and SPNet [277]. ALE [1] and SJE [2] are linear compatibility methods using a ranking loss and the structural SVM loss respectively. SYNC [34] learns a mapping from the feature space and the semantic embedding space by means of phantom classes and a weighted graph. GFZSL [265] employs a generative framework where each class-conditional distribution is modeled as a multivariate Gaussian. Finally, SPNet [277] learns a semantic projection function from the feature space through the image embedding space by minimizing the standard cross-entropy loss.
零样本学习(ZSL)。在零样本学习场景中,我们选择标准归纳方法以及较新的方法作为基线。具体来说,我们报告了ALE [1]、SJE [2]、SYNC [34]、GFZSL [265]和SPNet [277]的结果。ALE [1]和SJE [2]分别是使用排序损失和结构支持向量机(SVM)损失的线性兼容性方法。SYNC [34]通过虚拟类别和加权图学习特征空间和语义嵌入空间之间的映射。GFZSL [265]采用生成式框架,其中每个类别条件分布被建模为多元高斯分布。最后,SPNet [277]通过最小化标准交叉熵损失,从特征空间通过图像嵌入空间学习语义投影函数。

Our results grouped by datasets are reported in Figure 4.3. Our model achieves performance either superior or comparable to the state of the art in all benchmarks but AWA. We believe that in AWA learning a better alignment between visual features and attributes may not be as effective as improving the quality of the visual features. Especially, although the names of the test classes do not appear in the training set of ImageNet, for AWA being a non-fine-grained dataset, the information content of the test classes is likely represented by the ImageNet training classes. Moreover, for non-fine-grained datasets, finding labeled training data may not be as challenging as it is in fine-grained datasets. Hence, we argue that zero-shot learning is of higher practical interest in fine-grained settings. Indeed CuMix is effective in fine-grained scenarios (i.e. CUB, SUN, FLO) where it consistently outperforms the state-of-the-art approaches.
我们按数据集分组的结果如图4.3所示。除了AWA数据集外,我们的模型在所有基准测试中的性能均优于或与现有技术水平相当。我们认为在AWA数据集中,学习视觉特征和属性之间更好的对齐可能不如提高视觉特征的质量有效。特别是,尽管测试类别的名称未出现在ImageNet的训练集中,但由于AWA是非细粒度数据集,测试类别的信息内容可能由ImageNet训练类别表示。此外,对于非细粒度数据集,寻找带标签的训练数据可能不像在细粒度数据集那样具有挑战性。因此,我们认为零样本学习在细粒度设置中具有更高的实际应用价值。实际上,CuMix在细粒度场景(即CUB、SUN、FLO)中非常有效,它始终优于现有技术方法。

Table 4.1. Domain Generalization accuracies on PACS with ResNet-18.
表4.1. 使用ResNet - 18在PACS数据集上的领域泛化准确率。

TargetAGGDANN [78]MLDG [134]CrossGrad [238]MetaReg [10]JiGen [27]Epi-FCR [135]CuMix
Photo94.994.094.394.094.396.093.995.1
Art76.181.379.578.779.579.482.182.3
Cartoon73.873.877.373.375.475.377.076.5
Sketch69.474.371.565.172.271.473.072.6
Average78.580.880.780.777.880.481.581.6
目标聚合(AGG)领域对抗神经网络(DANN) [78]多任务学习领域泛化(MLDG) [134]交叉梯度(CrossGrad) [238]元正则化(MetaReg) [10]拼图生成(JiGen) [27]情景式特征对比正则化(Epi - FCR) [135]混合裁剪(CuMix)
照片94.994.094.394.094.396.093.995.1
艺术画76.181.379.578.779.579.482.182.3
卡通画73.873.877.373.375.475.377.076.5
素描69.474.371.565.172.271.473.072.6
平均值78.580.880.780.777.880.481.581.6


These results show that our model based on mixup achieves competitive performances on ZSL by simulating the semantic shift the classifier will experience at test time. To this extent, our approach is the first to show that mixup can be a powerful regularization strategy for the challenging ZSL setting.
这些结果表明,我们基于混合增强(mixup)的模型通过模拟分类器在测试时会经历的语义偏移,在零样本学习(ZSL)任务上取得了具有竞争力的性能。在这方面,我们的方法首次证明了混合增强可以成为应对具有挑战性的零样本学习设置的强大正则化策略。

DG. The second series of experiments consider the standard DG scenario. Here we test our model on the PACS dataset using a ResNet-18 architecture. As baselines for DG we consider the standard model trained on all source domains together (AGG), the adversarial strategies in [78] (DANN) and [238], the meta learning-based strategy MLDG [134] and MetaReg [10]. Moreover we consider the episodic strategy presented in [135] (Epi-FCR).
领域泛化(DG)。第二组实验考虑了标准的领域泛化场景。在这里,我们使用ResNet - 18架构在PACS数据集上测试我们的模型。作为领域泛化的基线,我们考虑在所有源领域上一起训练的标准模型(AGG)、文献[78](DANN)和[238]中的对抗策略、基于元学习的策略MLDG [134]和MetaReg [10]。此外,我们还考虑了文献[135]中提出的情节式策略(Epi - FCR)。

As shown in Table 4.1, our model achieves competitive results comparable to the state-of-the-art episodic strategy Epi-FCR [135]. Remarkable is the gain obtained with respect to the adversarial augmentation strategy CrossGrad [238]. Indeed, synthesizing novel domains for domain generalization is an ill-posed problem, since the concept of unseen domain is hard to capture. However, with CuMix we are able to simulate inputs/features of novel domains by simply interpolating the information available in the samples of our sources. Despite containing information available in the original sources, our approach produces a model more robust to domain shift.
如表4.1所示,我们的模型取得了与当前最先进的情节式策略Epi - FCR [135]相当的有竞争力的结果。值得注意的是,相对于对抗增强策略CrossGrad [238]有显著提升。实际上,为领域泛化合成新的领域是一个不适定问题,因为难以捕捉未见领域的概念。然而,通过CuMix,我们能够通过简单地对源样本中可用的信息进行插值来模拟新领域的输入/特征。尽管包含原始源中的可用信息,但我们的方法产生的模型对领域偏移更具鲁棒性。

Another interesting comparison is against the self-supervised approach JiGen [27]. Similarly to [27] we employ an additional task to achieve higher generalization abilities to unseen domains. While in [27] the JigSaw puzzles [194] are used as a secondary self-supervised task, here we employ the mixed samples/features in the same manner. The improvement in the performances of CuMix highlights that recognizing the semantic of mixed samples acts as a more powerful secondary task to improve robustness to unseen domains.
另一个有趣的比较是与自监督方法JiGen [27]进行的。与文献[27]类似,我们采用了一个额外的任务来实现对未见领域更高的泛化能力。在文献[27]中,拼图游戏(JigSaw puzzles)[194]被用作次要的自监督任务,而在这里我们以同样的方式使用混合样本/特征。CuMix性能的提升表明,识别混合样本的语义作为一种更强大的次要任务,可以提高对未见领域的鲁棒性。

Finally, it is worth noting that CuMix performs a form of episodic training, similar to Epi-FCR [135]. However, while Epi-FCR considers multiple domain-specific architectures (to simulate the domain experts needed to build the episodes), we require a single domain agnostic architecture. We build our episodes by making the mixup among images/features of different domains increasingly more drastic. Despite not requiring any domain experts, CuMix achieves comparable performances to Epi-FCR, showing the efficacy of our strategy to simulate unseen domain shifts.
最后,值得注意的是,CuMix执行了一种情节式训练,类似于Epi - FCR [135]。然而,虽然Epi - FCR考虑了多个特定领域的架构(以模拟构建情节所需的领域专家),但我们只需要一个与领域无关的单一架构。我们通过对不同领域的图像/特征进行越来越剧烈的混合增强来构建我们的情节。尽管不需要任何领域专家,CuMix仍取得了与Epi - FCR相当的性能,这表明了我们模拟未见领域偏移策略的有效性。

Ablation study. In this section, we ablate the various components of CuMix. We performed the ablation on the PACS benchmark for DG, since this allows us to show how different choices act on the generalization to unseen domains. In particular, we ablate the following implementation choices: 1) mixing samples at the image level, feature level or both 2) impact of our curriculum-based strategy for mixing features and samples.
消融研究。在本节中,我们对CuMix的各个组件进行消融实验。我们在用于领域泛化的PACS基准上进行消融实验,因为这使我们能够展示不同的选择对未见领域泛化的影响。具体来说,我们对以下实现选择进行消融:1)在图像级别、特征级别或两者同时混合样本;2)我们基于课程的特征和样本混合策略的影响。

Table 4.2. Ablation on PACS dataset with ResNet-18 as backbone.
表4.2. 以ResNet - 18为骨干网络在PACS数据集上的消融实验。

LAGG LM-IMG LM-F CurriculumArtCartoonPhotoSketchAvg.
76.173.894.969.478.5
78.472.794.759.576.3
81.876.594.971.281.1
82.775.495.471.581.2
82.376.595.172.681.6
LAGG LM-IMG LM-F 课程表(Curriculum)艺术(Art)卡通(Cartoon)照片(Photo)素描(Sketch)平均(Avg.)
76.173.894.969.478.5
78.472.794.759.576.3
81.876.594.971.281.1
82.775.495.471.581.2
82.376.595.172.681.6


As shown in Table 4.2, mixing samples at feature level produces a clear gain on the results with respect to the baseline, while mixing samples only at image level can even harm the performance. This happens particularly in the sketch domain, where mixing samples at feature level produces a gain of 2% while at image level we observe a drop of 10% with respect to the baseline. This could be explained by mixing samples at image level producing inputs that are too noisy for the network and not representative of the actual shift experienced at test time. Mixing samples at feature level instead, after multiple layers of abstractions, allows to better synthesize the information contained in the different samples, leading to more reliable features for the classifier. Using both of them we obtain higher results in almost all domains.
如表4.2所示,在特征层面混合样本相对于基线在结果上有明显提升,而仅在图像层面混合样本甚至会损害性能。这种情况在草图领域尤为明显,在特征层面混合样本可带来2%的提升,而在图像层面相对于基线我们观察到10%的下降。这可以解释为,在图像层面混合样本会产生对网络来说噪声过大的输入,且不能代表测试时实际经历的偏移。相反,在特征层面混合样本,经过多层抽象后,可以更好地合成不同样本中包含的信息,为分类器提供更可靠的特征。同时使用这两种方法,我们在几乎所有领域都能获得更高的结果。

Finally, we analyze the impact of the curriculum-based strategy for mixing samples and features. As the table shows, adding the curriculum strategy allows to boost the performances for the most difficult cases (i.e. sketches) producing a further accuracy boost. Moreover, applying this strategy allows to stabilize the training procedure, as demonstrated experimentally.
最后,我们分析基于课程的样本和特征混合策略的影响。如表所示,添加课程策略可以提升最困难情况下(即草图)的性能,进一步提高准确率。此外,实验证明,应用此策略可以稳定训练过程。

ZSL+DG. On the proposed ZSL+DG setting we use the DomainNet dataset, training on five out of six domains and reporting the average per-class accuracy on the held-out one. We report the results for all possible target domains but one, i.e. real photos, since our backbone has been pre-trained on ImageNet, thus the photo domain is not an unseen one. Since no previous methods addressed the ZSL+DG problem, in this section we consider simple baselines derived from the literature of both ZSL and DG. The first baseline is a standard ZSL model without any DG algorithm (i.e. the standard AGG): as ZSL method we consider SPNet [277]. The second baseline is a DG approach coupled with a ZSL algorithm. To this extent we select the state-of-the-art Epi-FCR as the DG approach, coupling it with SPNet. As reference, we also evaluate the performance of standard mixup coupled with SPNet.
零样本学习+领域泛化(ZSL+DG)。在提出的ZSL+DG设置中,我们使用DomainNet数据集,在六个领域中的五个领域上进行训练,并报告保留领域上的平均每类准确率。我们报告除了一个领域(即真实照片)之外所有可能目标领域的结果,因为我们的骨干网络在ImageNet上进行了预训练,所以照片领域不是未见领域。由于之前没有方法解决ZSL+DG问题,在本节中,我们考虑从ZSL和DG文献中得出的简单基线。第一个基线是没有任何DG算法的标准ZSL模型(即标准AGG):作为ZSL方法,我们考虑SPNet [277]。第二个基线是与ZSL算法结合的DG方法。为此,我们选择最先进的Epi - FCR作为DG方法,并将其与SPNet结合。作为参考,我们还评估了与SPNet结合的标准混合(mixup)方法的性能。

As shown in Table 4.3, CuMix achieves competitive performances in ZSL+DG setting when compared to a state-of-the-art approach for DG (Epi-FCR) coupled with a state-of-the-art one for ZSL (SPNet), outperforming this baseline in almost all settings but sketch and, in average by almost 1%. Particularly interesting are the results on the infograph and quickdraw domains. These two domains are the ones where the shift is more evident as highlighted by the lower results of the baseline. In these scenarios, our model consistently outperforms the competitors, with a remarkable gain of more than 1.5% in average accuracy per class with respect to the ZSL only baseline. We want to highlight also that DomainNet is a challenging dataset, where almost all standard DA approaches are ineffective or can even lead to negative transfer [206]. CuMix however is able to overcome the unseen domain shift at test time, improving the performance of the baselines in all scenarios. Our model consistently outperforms SPNet coupled with the standard mixup strategy in every scenario. This demonstrates the efficacy of the choices in CuMix for revisiting mixup in order to recognize unseen categories in unseen domains.
如表4.3所示,与用于DG的最先进方法(Epi - FCR)和用于ZSL的最先进方法(SPNet)结合的方法相比,CuMix在ZSL+DG设置中取得了有竞争力的性能,除了草图领域外,在几乎所有设置中都优于该基线,平均高出近1%。信息图和快速绘图领域的结果尤其有趣。这两个领域是偏移最为明显的领域,基线的较低结果也凸显了这一点。在这些场景中,我们的模型始终优于竞争对手,相对于仅使用ZSL的基线,每类平均准确率有超过1.5%的显著提升。我们还想强调的是,DomainNet是一个具有挑战性的数据集,几乎所有标准的领域适应(DA)方法都无效,甚至可能导致负迁移 [206]。然而,CuMix能够在测试时克服未见领域偏移,在所有场景中提高基线的性能。我们的模型在每个场景中始终优于与标准混合策略结合的SPNet。这证明了CuMix中重新审视混合方法以识别未见领域中未见类别的选择是有效的。

Table 4.3. ZSL+DG scenario on the DomainNet dataset with ResNet-50 as backbone.
表4.3. 以ResNet - 50为骨干网络的DomainNet数据集上的ZSL+DG场景。

MethodClipartInfographPaintingQuickdrawSketchAvg.
SPNet26.016.923.88.221.819.4
mixup+SPNet27.216.924.78.521.319.7
Epi-FCR+SPNet26.416.724.69.223.220.0
CuMix27.617.825.59.922.620.7
方法剪贴画信息图绘画快速绘图素描平均值
SP网络(SPNet)26.016.923.88.221.819.4
混合增强+SP网络(mixup+SPNet)27.216.924.78.521.319.7
表型特征对比正则化+SP网络(Epi-FCR+SPNet)26.416.724.69.223.220.0
CuMix(暂未找到通用译法,保留原文)27.617.825.59.922.620.7


4.3.4 Conclusions
4.3.4 结论


In this section, we proposed the novel ZSL+DG scenario. In this setting, during training, we are given a set of images of multiple domains and semantic categories and our goal is to build a model able to recognize unseen concepts, as in ZSL, in unseen domains, as in DG. To solve this problem we design CuMix, the first algorithm which can be holistically and effectively applied to DG, ZSL, and ZSL+DG. CuMix is based on simulating inputs and features of new domains and categories during training by mixing the available source domains and classes, both at image and feature level. Experiments on public benchmarks show the effectiveness of CuMix, achieving state-of-the-art performances in almost all settings in all tasks. Future works might investigate alternative data-augmentation schemes in the ZSL+DG setting as well as the use of novel formulations of the mixing functions. Moreover, it would be interesting to extend CuMix to the more realistic Generalized-ZSL scenario, where the model must recognize both seen and unseen categories.
在本节中,我们提出了新颖的零样本学习+领域泛化(ZSL+DG)场景。在这种设定下,在训练过程中,我们会得到多个领域和语义类别的一组图像,我们的目标是构建一个模型,该模型能够像零样本学习(ZSL)那样识别未见概念,同时像领域泛化(DG)那样在未见领域中进行识别。为了解决这个问题,我们设计了CuMix,这是第一个能够全面且有效地应用于领域泛化(DG)、零样本学习(ZSL)和零样本学习+领域泛化(ZSL+DG)的算法。CuMix基于在训练期间通过在图像和特征层面混合可用的源领域和类别来模拟新领域和类别的输入和特征。在公共基准数据集上的实验表明了CuMix的有效性,在所有任务的几乎所有设定中都达到了当前最优性能。未来的工作可能会研究零样本学习+领域泛化(ZSL+DG)设定下的替代数据增强方案,以及使用新颖的混合函数公式。此外,将CuMix扩展到更现实的广义零样本学习(Generalized - ZSL)场景会很有趣,在该场景中,模型必须识别已见和未见类别。

Chapter 5 Conclusions and Future Works
第5章 结论与未来工作


5.1 Summary of contributions
5.1 贡献总结


In this thesis, we analyzed the capability of deep neural networks to generalize to unseen input distributions and to include knowledge not present in their initial training set, with the final goal of building deep models able to recognize new/unseen categories in unseen visual domains.
在本论文中,我们分析了深度神经网络泛化到未见输入分布以及纳入初始训练集中不存在的知识的能力,最终目标是构建能够在未见视觉领域中识别新的/未见类别的深度模型。

In Chapter 2, we started by analyzing the problem from a perspective of the input the network receives, considering scenarios where training (source) and test (target) output spaces do not change but their input distribution does. In particular, in Section 2.4 we considered the problem of latent domain discovery in domain adaptation. In this setting, we assume the availability of unlabeled target data during training and that either source or target domains (or both) are a mixture of multiple latent domains. In this context, we proposed the first deep neural network able to work in this scenario. Our architecture is made of two main components, namely novel multi-domain alignment layers (mDA) and a domain prediction branch. The mDA layers perform batch-normalization (BN) [109], extending previous works on domain adaptation [28,29,142] ,through weighted statistics,computed using the domain probabilities extracted by the domain prediction branch. The domain prediction branch relies on the assumption that similar inputs should produce similar activations, and it is trained through a simple entropy-loss, without requiring any domain label. Our results show that our framework can successfully enable the deep model to discover latent domains and outperform standard single-source methods.
在第2章中,我们从网络接收的输入角度开始分析问题,考虑训练(源)和测试(目标)输出空间不变但输入分布不同的场景。特别是在2.4节中,我们考虑了领域自适应中的潜在领域发现问题。在这种设定下,我们假设在训练期间有未标记的目标数据可用,并且源领域或目标领域(或两者)是多个潜在领域的混合。在这种情况下,我们提出了第一个能够在该场景下工作的深度神经网络。我们的架构由两个主要组件组成,即新颖的多领域对齐层(mDA)和一个领域预测分支。mDA层执行批量归一化(BN)[109],通过使用领域预测分支提取的领域概率计算的加权统计量,扩展了之前关于领域自适应的工作[28,29,142]。领域预测分支基于相似输入应产生相似激活的假设,并且通过简单的熵损失进行训练,无需任何领域标签。我们的结果表明,我们的框架可以成功地使深度模型发现潜在领域,并优于标准的单源方法。

As a second step, we removed the assumption of having target data available during training by considering the domain generalization (DG) scenario (Section 2.5). Here we build on the idea that improvements on the performances of a DG model can be achieved by modeling the similarity of a target sample to the available source domains. We thus develop a simple extension to the latent domain discovery framework which makes use of domain labels (if available) and of the domain prediction branch at test time, to decide which of the source domains should contribute more in the final decision. In particular, we use domain-specific BN layers, weighting their activations using the similarity that the target sample has with the source domains. Experiments on robotics scenarios show the effectiveness of the approach in multiple place categorization benchmarks under various domain shifts (e.g. light conditions, seasons, environments) with and without the presence of domain labels (Section 2.5.2). Subsequently, we extend this solution to levels of the network different from BN layers. In particular, we consider merging the activation of domain-specific classifiers at test time, with their importance weighted again by the similarity of a target sample to the source domains. We also explore the use of different kinds of merging strategies, balancing domain-specific features with domain-agnostic ones. Results show the effectiveness of the approach against domain generalization baselines in standard benchmarks (Section 2.5.4).
作为第二步,我们通过考虑领域泛化(DG)场景(2.5节),去除了在训练期间有目标数据可用的假设。在这里,我们基于这样的想法:通过对目标样本与可用源领域的相似性进行建模,可以提高领域泛化模型的性能。因此,我们对潜在领域发现框架进行了简单扩展,该扩展在测试时利用领域标签(如果可用)和领域预测分支,以决定哪个源领域在最终决策中应贡献更多。特别是,我们使用特定领域的批量归一化(BN)层,利用目标样本与源领域的相似性对其激活进行加权。在机器人场景上的实验表明,该方法在各种领域偏移(例如光照条件、季节、环境)下的多个场所分类基准测试中是有效的,无论是否存在领域标签(2.5.2节)。随后,我们将此解决方案扩展到网络中不同于批量归一化层的层次。特别是,我们考虑在测试时合并特定领域分类器的激活,其重要性再次由目标样本与源领域的相似性进行加权。我们还探索了使用不同类型的合并策略,平衡特定领域特征和领域无关特征。结果表明,在标准基准测试中,该方法相对于领域泛化基线是有效的(2.5.4节)。
While DG is one of the domain adaptation scenario where target data are not present, other settings can be considered, depending on the amount of information we have about our target domain. Studying other aspects of the problem, in Section 2.6 we focused on the continuous domain adaptation scenario, where a single source domain is available during training (without any target data) and adaptation must be performed exploiting the incoming stream of target samples at test time, without access to the original training set. In this context, we develop an extension to the domain alignment layers [28,29,142] to tackle this problem. In particular,we show how updating the statistics of BN layers using the incoming stream of target data can be a simple yet effective strategy for tackling this problem. We assess the performance of our model, ONDA, on a robotic object classification task, collecting and releasing a dataset for studying this unexplored problem, containing multiple objects in various acquisition conditions.
虽然领域泛化(Domain Generalization,DG)是目标数据不存在的领域自适应场景之一,但根据我们对目标领域的信息掌握程度,还可以考虑其他设置。在研究该问题的其他方面时,在2.6节中,我们聚焦于连续领域自适应场景,即在训练期间只有一个源领域可用(没有任何目标数据),并且必须在测试时利用传入的目标样本流进行自适应,而无法访问原始训练集。在此背景下,我们对领域对齐层[28,29,142]进行了扩展以解决这个问题。具体而言,我们展示了如何使用传入的目标数据流更新批量归一化(Batch Normalization,BN)层的统计信息,这可以是一种简单而有效的解决此问题的策略。我们在机器人对象分类任务中评估了我们的模型ONDA的性能,并收集并发布了一个用于研究这个未被充分探索问题的数据集,该数据集包含在各种采集条件下的多个对象。

Finally, we considered the predictive domain adaptation (PDA) scenario where, during training, we have a single labeled source domain and multiple unlabeled auxiliary domains, each of them with a description (i.e. metadata) attached. The goal of this problem is to build a model able to address the classification task in the target domain by using just the target-specific metadata. We develop the first deep learning model for this problem, AdaGraph (Section 2.7), which builds on a graph, where each node is a domain with attached its domain-specific parameters, and its edge is the distance among the metadata of the two connected domains. At test time, given the target domain metadata, we obtain the target-specific parameters through a weighted combination of its closest nodes in the graph. Due to their simplicity and the easiness of linearly combining them, we use domain-specific BN layers as domain-specific parameters. To improve the estimated target statistics, we also incorporate a continuous domain adaptation strategy into the framework, extending the previously described ONDA algorithm. Experiments show how our model outperforms standard PDA approaches, with the continuous update strategy being surpassing state-of-the-art approaches in continuous domain adaptation.
最后,我们考虑了预测性领域自适应(Predictive Domain Adaptation,PDA)场景,在训练期间,我们有一个带标签的源领域和多个无标签的辅助领域,每个辅助领域都附有描述(即元数据)。这个问题的目标是构建一个仅使用特定于目标的元数据就能解决目标领域分类任务的模型。我们为这个问题开发了第一个深度学习模型AdaGraph(2.7节),它基于图结构构建,其中每个节点是一个附有特定领域参数的领域,其边表示两个相连领域元数据之间的距离。在测试时,给定目标领域的元数据,我们通过图中与其最接近节点的加权组合来获得特定于目标的参数。由于领域特定的批量归一化(BN)层简单且易于线性组合,我们将其用作特定领域的参数。为了改进估计的目标统计信息,我们还将连续领域自适应策略纳入该框架,扩展了之前描述的ONDA算法。实验表明,我们的模型优于标准的PDA方法,连续更新策略在连续领域自适应方面超越了最先进的方法。

In Chapter 3, we moved to the problem of extending the output space of a pre-trained architecture to new semantic categories. We started by analyzing the problem of multi-domain learning, where the goal is to add new (classification) tasks to a pre-trained model without harming the performance on old tasks and with as few task-specific parameters as possible. Our contribution (Section 3.3) has been showing how affinely transformed task-specific binary masks applied to the original network weights can allow a network to learn multiple models with (i) performances close to networks fine-tuned for the specific tasks and (ii) with very little overhead in terms of the number of parameters required for each task. We assess the performance of our model on the challenging Visual Domain Decathlon, showing performance comparable with more complicated/multi-stages state-of-the-art approaches.
在第3章中,我们转向将预训练架构的输出空间扩展到新语义类别的问题。我们首先分析多领域学习问题,其目标是在不损害预训练模型在旧任务上的性能的前提下,尽可能少地使用特定于任务的参数,为其添加新的(分类)任务。我们的贡献(3.3节)在于展示了对原始网络权重应用仿射变换的特定任务二进制掩码,如何使网络能够学习多个模型,这些模型(i)的性能接近针对特定任务进行微调的网络,并且(ii)每个任务所需的参数数量开销非常小。我们在具有挑战性的视觉领域十项全能任务中评估了我们模型的性能,结果显示其性能与更复杂/多阶段的最先进方法相当。
In Section 3.4, we focused on a different problem, incremental class learning. In this task, we want to add new knowledge to a pre-trained model without having access to the original training set, thus addressing the catastrophic forgetting problem. We analyzed this task in semantic segmentation discovering how the performance of standard incremental learning algorithms is hampered by the change of the semantic of the background class in different learning steps, a problem which we named the background shift. Indeed, in a given learning step, the background might contain pixels of classes learned in previous steps as well as pixels of classes we will learn in future ones. We showed how a simple modification of standard cross-entropy and distillation losses, taking explicitly into account the different meaning of the background across different learning steps and coupled with an ad-hoc initialization procedure, can effectively address both catastrophic forgetting and background shift, even in large-scale scenarios (e.g. ADE-20k).
在3.4节中,我们聚焦于一个不同的问题,即增量类学习。在这个任务中,我们希望在无法访问原始训练集的情况下,为预训练模型添加新知识,从而解决灾难性遗忘问题。我们在语义分割任务中分析了这个问题,发现标准增量学习算法的性能受到不同学习步骤中背景类语义变化的阻碍,我们将这个问题称为背景偏移。实际上,在给定的学习步骤中,背景可能包含在先前步骤中学习到的类别的像素以及我们将在未来步骤中学习的类别的像素。我们展示了如何对标准的交叉熵和蒸馏损失进行简单修改,明确考虑不同学习步骤中背景的不同含义,并结合特定的初始化过程,即使在大规模场景(例如ADE - 20k)中,也能有效解决灾难性遗忘和背景偏移问题。

In the final section of Chapter 3, we studied a more challenging problem, open-world recognition (OWR). In this task, we must not only be able to add new concepts to a pre-trained model, but also to detect unknown concepts if received as inputs. In this scenario, we developed DeepNNO (Section 3.5.3), the first end-to-end trainable model in OWR, extending standard non-parametric algorithms [15] with losses and training schemes preventing catastrophic forgetting. We then showed how clustering-based objectives and trainable class-specific rejection thresholds can further boost the performances of deep OWR model (Section 3.5.4). The experiments on standard datasets and robotics scenarios showed the efficacy of the two approaches and the importance of each design choice. Moreover, we described and tested a simple pipeline for Web-aided OWR, where knowledge about new classes is not given by an external 'oracle' but automatically retrieved from web queries (Section 3.5.6). We believe our algorithms and our web-based pipeline constitute a first meaningful step towards autonomously learning real-world agents.
在第三章的最后一节,我们研究了一个更具挑战性的问题,即开放世界识别(OWR)。在这项任务中,我们不仅要能够向预训练模型添加新概念,还要能够检测作为输入接收到的未知概念。在这种情况下,我们开发了DeepNNO(3.5.3节),这是开放世界识别中第一个端到端可训练的模型,它通过损失函数和训练方案扩展了标准的非参数算法[15],防止了灾难性遗忘。然后,我们展示了基于聚类的目标和可训练的特定类别的拒绝阈值如何进一步提升深度开放世界识别模型的性能(3.5.4节)。在标准数据集和机器人场景上的实验证明了这两种方法的有效性以及每个设计选择的重要性。此外,我们描述并测试了一个用于网络辅助开放世界识别的简单流程,在这个流程中,关于新类别的知识不是由外部“神谕”提供的,而是从网络查询中自动获取的(3.5.6节)。我们相信我们的算法和基于网络的流程是迈向自主学习现实世界智能体的有意义的第一步。

Finally, in Chapter 4, we merged the two worlds, analyzing whether it is possible to build a model recognizing unseen classes in unseen domains. In particular, we described the problem of Zero-Shot Learning (ZSL) under Domain Generalization (DG), where, during training, we are given images of a set of classes in multiple source domains and, at test time, we are asked to recognize different, unseen categories depicted in unseen visual domains. In this scenario, we must learn how to map images into a semantic embedding space of class descriptions (e.g. word embeddings) making sure that the mapping generalizes both to unseen semantic classes (addressing the semantic shift problem) and to unseen domains (addressing the domain shift problem). We developed the first simple solution to this problem based on mixup [301]. In particular, our idea was to simulate the shifts we will encounter at test time by simulating samples (and features) of new domains and/or categories by mixing the domains and classes available at training time. Moreover, we made the mixes increasingly more challenging during training by increasing both the probability of having a high mixing ratio and of cross-domain mixing. Our approach, named CuMix, showed remarkable results on ZSL, DG, and the proposed ZSL+DG, being not only the first holistic approach for ZSL and DG but also the first model that effectively recognizes unseen categories in unseen domains.
最后,在第四章中,我们将两个领域结合起来,分析是否有可能构建一个能够识别未见领域中未见类别的模型。具体来说,我们描述了在领域泛化(DG)下的零样本学习(ZSL)问题,在训练过程中,我们会得到多个源领域中一组类别的图像,而在测试时,我们需要识别未见视觉领域中描绘的不同的、未见的类别。在这种情况下,我们必须学习如何将图像映射到类别描述的语义嵌入空间(例如词嵌入),同时确保这种映射能够泛化到未见的语义类别(解决语义偏移问题)和未见的领域(解决领域偏移问题)。我们基于mixup[301]开发了这个问题的第一个简单解决方案。具体来说,我们的想法是通过在训练时混合可用的领域和类别来模拟新领域和/或类别的样本(和特征),从而模拟我们在测试时会遇到的偏移。此外,我们在训练过程中通过增加高混合比例和跨领域混合的概率,使混合变得越来越具有挑战性。我们的方法名为CuMix,在零样本学习、领域泛化以及提出的零样本学习+领域泛化任务中都取得了显著的成果,它不仅是零样本学习和领域泛化的第一个整体方法,也是第一个能够有效识别未见领域中未见类别的模型。

5.2 Open problems and future directions
5.2 开放问题与未来方向


While in this thesis we studied how to build deep learning models generalizing to either new visual domains (Chapter 2) or new semantic concepts (Chapter 3) or both (Chapter 4), multiple problems remain to be addressed and multiple directions to be explored towards having visual systems recognizing new semantic concepts in arbitrary visual domains.
虽然在本论文中,我们研究了如何构建能够泛化到新视觉领域(第二章)、新语义概念(第三章)或两者兼顾(第四章)的深度学习模型,但仍有多个问题有待解决,并且有多个方向值得探索,以实现视觉系统在任意视觉领域中识别新语义概念的目标。

Starting with the proposed solutions, for each of the proposed algorithms, we briefly discussed possible immediate extensions as well as interesting research directions worth to be explored. For instance, in Chapter 2 we discussed multiple solutions involving domain-specific parameters whose activations are merged by using the weights obtained a domain prediction branch. For all these solutions, it might be interesting to investigate how to strengthen the domain classifier. For instance, one can avoid the use of a domain-specific classifier but rely on the distances of the activations from the various domain-specific distributions (e.g. statistics of BN layers) as a measure of domain similarity. On the other hand, stronger clustering objectives could be applied as objectives for the domain prediction branch to strengthen the discovering of the latent domains. Moreover, it would be interesting to investigate if the algorithms of Chapter 2 can be extended to use parameters beyond standard BN layers, as preliminary experiments with classifiers in Section 2.5.4. To this extent, also the affinely transformed binary masks of Section 3.3 can be a good starting point for having simple and easy to combine domain-specific parameters.
从提出的解决方案开始,对于每个提出的算法,我们简要讨论了可能的直接扩展以及值得探索的有趣研究方向。例如,在第二章中,我们讨论了多种涉及特定领域参数的解决方案,这些参数的激活值通过使用从领域预测分支获得的权重进行合并。对于所有这些解决方案,研究如何加强领域分类器可能会很有趣。例如,可以避免使用特定领域的分类器,而是依靠激活值与各种特定领域分布(例如批量归一化层的统计信息)的距离作为领域相似性的度量。另一方面,可以将更强的聚类目标应用于领域预测分支,以加强对潜在领域的发现。此外,研究第二章的算法是否可以扩展到使用标准批量归一化层之外的参数也会很有趣,如2.5.4节中使用分类器的初步实验所示。在这方面,3.3节中经过仿射变换的二进制掩码也可以作为拥有简单且易于组合的特定领域参数的良好起点。

When tackling Predictive DA with AdaGraph (Section 2.7), we assign to each domain a node in our graph. However, the current formulation has two main drawbacks. First, it is not scalable if the set of possible metadata descriptors increases; second it considers all the metadata equally important for addressing the domain shift problem. Future works might explore different strategies for including metadata-specific information as well as modeling the importance of the different metadata components. For instance, a possible solution could be to employ domain-specific alignment layers per each metadata group (e.g. viewpoint, year of production) and learning how to optimally recombine their activations for the final prediction.
在使用AdaGraph处理预测性领域适应(2.7节)时,我们为图中的每个领域分配一个节点。然而,当前的公式有两个主要缺点。首先,如果可能的元数据描述符集合增加,它将无法扩展;其次,它认为所有元数据对于解决领域偏移问题同样重要。未来的工作可以探索不同的策略来纳入特定元数据的信息,并对不同元数据组件的重要性进行建模。例如,一个可能的解决方案是为每个元数据组(例如视角、生产年份)使用特定领域的对齐层,并学习如何为最终预测最优地重新组合它们的激活值。

Other interesting research directions can be drawn from the works in Chapter 3. For instance, the first question is whether the multi-domain algorithm presented in Section 3.3 can be applied to other scenarios (e.g. incremental class learning) and even to tackle the domain shift problem (e.g. PDA, DG). In the latter case, one can use one binary-mask per network parameter per domain, combining them at test time based on the target domain metadata (PDA) or similarity to the source domains (DG). Another interesting question is whether the model can exploit the relationships among the single task/domains through side connections, in such a way that each task/domain benefits from the others.
从第3章的研究工作中可以引出其他有趣的研究方向。例如,第一个问题是第3.3节中提出的多领域算法是否可以应用于其他场景(例如增量类学习),甚至用于解决领域偏移问题(例如部分领域自适应(PDA)、领域泛化(DG))。在后一种情况下,可以为每个领域的每个网络参数使用一个二进制掩码,并在测试时根据目标领域元数据(PDA)或与源领域的相似度(DG)将它们组合起来。另一个有趣的问题是,模型是否可以通过侧连接利用单任务/领域之间的关系,使得每个任务/领域都能从其他任务/领域中受益。

For what concerns incremental learning in semantic segmentation, it would be interesting to quantify the background shift and understanding whether the different kinds of shift (e.g. background containing old classes vs background containing classes we will learn in the future) require more specific solutions than the general one we designed in Section 3.4. On the other hand, it would be interesting to verify if the effectiveness of our MiB algorithm (or simple extensions) generalizes to other problems where the semantic of the background is uncertain, such as incremental learning in object detection [240] and instance segmentation [207] as well as non-incremental tasks such as weakly-supervised semantic segmentation [13], generalized zero-shot learning [277], and dataset merging [58].
对于语义分割中的增量学习而言,量化背景偏移并了解不同类型的偏移(例如包含旧类别的背景与包含未来要学习类别的背景)是否需要比我们在第3.4节中设计的通用解决方案更具体的解决方案,这将是很有趣的。另一方面,验证我们的MiB算法(或简单扩展)的有效性是否能推广到其他背景语义不确定的问题,如目标检测中的增量学习[240]和实例分割[207],以及非增量任务,如弱监督语义分割[13]、广义零样本学习[277]和数据集合并[58],也会很有意思。
In OWR, an interesting future work would be to quantify the robustness of OWR algorithms to the domain shift. In this context, we could verify how much their capabilities of detecting unknowns and recognizing known concepts are affected by changes in the input distributions. Moreover, a very important research direction would be improving the various components of the web-based pipeline sketched in Section 3.5.6. For instance, we could develop a tool for automatic and robust labeling of the detected unknowns. Moreover, we could design algorithms for filtering the noisy web images retrieved and/or dealing with both noisy labels and domain shift while learning the new categories. We believe the latter being a promising direction towards having robotics visual systems learning fully autonomously from the environment they interact with.
在旧类保留(OWR)中,一个有趣的未来工作是量化OWR算法对领域偏移的鲁棒性。在这种情况下,我们可以验证输入分布的变化对它们检测未知概念和识别已知概念的能力有多大影响。此外,一个非常重要的研究方向是改进第3.5.6节中概述的基于网络的流程的各个组件。例如,我们可以开发一种工具,用于对检测到的未知概念进行自动且鲁棒的标注。此外,我们可以设计算法来过滤检索到的有噪声的网络图像,和/或在学习新类别时处理有噪声的标签和领域偏移问题。我们相信,后者是使机器人视觉系统能够从与之交互的环境中完全自主学习的一个有前途的方向。

Finally, in Chapter 4.3, we introduce a new research problem (ZSL+DG) and algorithm (CuMix) with the aim of encouraging the community towards developing models tackling both domain and semantic shift together. However, we believe that our ZSL+DG problem is just the beginning of this journey. Indeed, in principle, we would like the semantic space of our models to consider both seen and unseen categories (as in Generalized ZSL [278]) in arbitrary domains, while requiring the minimum number of source domains possible (even just one, as recent works in single-source domain generalization [268, 267, 212]). Moreover, we might receive data for new classes over time, as in incremental learning. In this case, we would like our model to recognize old seen, new seen, and still unseen categories at test time. This would require our model to address both semantic shift, with the relative bias among the set of classes (as in GZSL), the catastrophic forgetting problem [71] and related ones, such as our identified background shift. Moreover, if the data for the new classes come from new domains, we want our model to also address the domain shift problem. Despite that with CuMix and ZSL+DG we focused on a subset of these problems, We believe that the contributions of Chapter 4 and the findings of this thesis, will push researchers into exploring ways to overcome both domain and semantic shift together, towards building visual algorithms able to cope with the large and unpredictable variability of the real world.
最后,在第4.3章中,我们引入了一个新的研究问题(零样本学习+领域泛化(ZSL+DG))和一种算法(CuMix),旨在鼓励学术界开发能够同时处理领域偏移和语义偏移的模型。然而,我们认为我们的ZSL+DG问题只是这一研究旅程的开始。实际上,原则上,我们希望我们的模型的语义空间能够考虑任意领域中的已见和未见类别(如广义零样本学习[278]),同时尽可能减少所需的源领域数量(甚至像单源领域泛化的近期工作[268, 267, 212]那样只需要一个源领域)。此外,随着时间的推移,我们可能会收到新类别的数据,就像增量学习一样。在这种情况下,我们希望我们的模型在测试时能够识别旧的已见类别、新的已见类别和仍未见过的类别。这将要求我们的模型既要解决语义偏移问题,处理类别集合之间的相对偏差(如广义零样本学习),又要解决灾难性遗忘问题[71]以及相关问题,如我们所识别的背景偏移问题。此外,如果新类别的数据来自新的领域,我们希望我们的模型也能解决领域偏移问题。尽管通过CuMix和ZSL+DG我们专注于这些问题的一个子集,但我们相信第4章的贡献和本论文的研究结果将推动研究人员探索如何同时克服领域偏移和语义偏移,从而构建能够应对现实世界中巨大且不可预测的变异性的视觉算法。

Appendix A
附录A


Recognition across New Visual Domains
跨新视觉领域的识别


A.1 Latent Domain Discovery
A.1 潜在领域发现


A.1.1 mDA layers formulas
A.1.1 多领域自编码器(mDA)层公式


From Section 2.4, we have the output of our mDA layer denoted by
从第2.4节可知,我们的多领域自编码器(mDA)层的输出表示为

(A.1)yi=mDA(xi,wi;μ^,σ^)=dDwi,dx^i,d,

where, for simplicity:
其中,为简单起见:

(A.2)x^i,d=xiμ^dσ^d2+ϵ,

and the statistics are given by
统计量由下式给出

(A.3)μ^d=i=1bw^i,dxi

σ^d2=i=1bw^i,d(xiμ^d)2,

where w^i,d=wi,d/j=1bwj,d .
其中 w^i,d=wi,d/j=1bwj,d

From the previous equations we can derive the partial derivative of the loss function with respect to both the input xi and the domain assignment probabilities wi,d . Let us denote Lyi the partial derivative of the loss function L with respect to the output yi of the mDA layer. We have:
从前面的方程我们可以推导出损失函数关于输入 xi 和领域分配概率 wi,d 的偏导数。我们用 Lyi 表示损失函数 L 关于多领域自编码器(mDA)层输出 yi 的偏导数。我们有:

(A.4)x^i,dσ^d2=1d=d12(xiμ^d)(σ^d2+ϵ)32,

x^i,dμ^d=1d=d(σ^d2+ϵ)12,

and

(A.5)σ^d2xi=2w^i,d(xiμ^d),μ^dxi=w^i,d.

Thus,the partial derivative of L w.r.t. the input xi is:
因此, L 关于输入 xi 的偏导数为:

(A.6)Lxi=dDwi,dσ^d2+ϵ[LyiAdx^i,dBd],

where:
其中:

(A.7)Ad=i=1bw^i,dLyi,

Bd=i=1bw^i,dx^i,dLyi.

For the domain assignment probabilities wi,d we have:
对于域分配概率 wi,d,我们有:

(A.8)μ^dw^i,d=1d=dxi,

(A.9)σ^d2w^i,d=1d=d(xiμ^d)2,

(A.10)w^i,dwi,d=1d=d1i=ij=1bwj,dwi,d(j=1bwj,d)2.

Thus,the partial derivative of L w.r.t. wi,d is:
因此,L 关于 wi,d 的偏导数为:

(A.11)Lwi,d=x^i,d(LyiAd)12(x^i,d2σd2σd2+ϵ)Bd,

where Ad and Bd are defined as in (A.7).
其中 AdBd 的定义如 (A.7) 所示。

A.1.2 Training loss progress
A.1.2 训练损失进展


In this section, we plot the losses as the training progresses for the Digits-five experiments. The plots are shown in Figure A.1. For both MNIST-m and SVHN, the classification loss smoothly decreases, while the domain loss first decreases and then stabilizes around a fixed value. This is a consequence of the introduced balancing term on the domain assignments, which enforces the entropy to be low for the assignment of a single sample, but high for the assignments averaged across the entire batch. In Figures A. 2 and A. 3 we plot the single components of the classification and domain loss respectively. For the semantic part (Figure A.2), both the entropy loss on target sample and the cross-entropy loss on source samples decrease smoothly. For the domain assignment part (Figure A.3), we can see how the entropy loss on single samples rapidly decreases, while the average batch assignment keeps an high entropy, as expected. We highlight that when SVHN is used as target, the source domains are a bit closer to each other in appearance, thus the average batch entropy has a slightly lower value (i.e. the assignments are less balanced) with respect to the MNIST-m as target case.
在本节中,我们绘制了 Digits - five 实验在训练过程中的损失情况。这些图如图 A.1 所示。对于 MNIST - m 和 SVHN,分类损失平稳下降,而域损失先下降,然后在一个固定值附近趋于稳定。这是由于在域分配中引入了平衡项,该平衡项使得单个样本分配的熵较低,但整个批次平均分配的熵较高。在图 A.2 和图 A.3 中,我们分别绘制了分类损失和域损失的各个组成部分。对于语义部分(图 A.2),目标样本的熵损失和源样本的交叉熵损失都平稳下降。对于域分配部分(图 A.3),正如预期的那样,我们可以看到单个样本的熵损失迅速下降,而批次平均分配的熵保持较高。我们强调,当使用 SVHN 作为目标域时,源域在外观上彼此更接近一些,因此与以 MNIST - m 作为目标域的情况相比,批次平均熵的值略低(即分配的平衡性稍差)。

Finally, it is worth noticing that the domain loss reaches a stable value earlier than the classification components. This is a design choice, since we want to learn a semantic predictor on stable and confident domain assignments.
最后,值得注意的是,域损失比分类组件更早达到稳定值。这是一种设计选择,因为我们希望在稳定且可靠的域分配上学习语义预测器。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_175.jpg?x=282&y=283&w=1073&h=394&r=0

Figure A.1. Digits-five: plots of the domain (orange) and classification (blue) losses during the training phase.
图 A.1. Digits - five:训练阶段域损失(橙色)和分类损失(蓝色)的曲线图。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_175.jpg?x=305&y=811&w=1047&h=393&r=0

Figure A.2. Digits-five: plots of the cross-entropy loss on source samples (orange) and entropy loss on target sample (blue) for the semantic classifier during the training phase.
图 A.2. Digits - five:训练阶段语义分类器的源样本交叉熵损失(橙色)和目标样本熵损失(蓝色)的曲线图。


A.1.3 Additional Results on PACS
A.1.3 PACS 上的额外结果


A crucial problem in domain adaptation rarely addressed in the literature is how to tune model hyper-parameters. In fact, setting the hyper-parameters values based on the performance on the source domain is sub-optimal, due to the domain shift. Furthermore, assuming the presence of a validation set for the target domain is not realistic in practice [183]: in unsupervised domain adaptation we only assume the presence of a set of unlabelled target data. Despite recent research in this direction [183], there is no clear solution to this problem in the literature. This problem is more severe in our case, since it is not trivial to define a validation set for the latent domain discovery problem, due to the assumption that multiple source and target domains are mixed.
领域自适应中一个在文献中很少被提及的关键问题是如何调整模型超参数。实际上,由于存在域偏移,基于源域的性能来设置超参数值并不是最优的。此外,假设目标域存在验证集在实际中并不现实 [183]:在无监督域自适应中,我们仅假设存在一组未标记的目标数据。尽管近期在这方面有相关研究 [183],但文献中尚未有针对该问题的明确解决方案。在我们的案例中,这个问题更为严重,因为由于假设多个源域和目标域是混合的,所以为潜在域发现问题定义一个验证集并非易事。

Nonetheless, for the sake of completeness, we analyze the performances of our model and the baselines if we assume the presence of a target validation set to perform model selection. We consider the PACS dataset, in both the single and multi-target scenarios. The results are reported in parenthesis in Table A. 1 and in Table A.2. While both our model and the baselines obviously benefit from the validation set, the overall trends remain the same, with our model achieving higher performances with respect to the baseline and close to the multi-source upper bound. Notice that a validation set is especially beneficial in the case of consistent domain shift: for instance, all the methods increase their results by almost 5% in Table A. 1 when Sketch is the target domain.
尽管如此,为了完整性,我们分析了在假设存在目标验证集以进行模型选择的情况下,我们的模型和基线模型的性能。我们考虑了 PACS 数据集,包括单目标和多目标场景。结果在表 A.1 和表 A.2 的括号中给出。虽然我们的模型和基线模型显然都从验证集中受益,但总体趋势保持不变,我们的模型相对于基线模型取得了更高的性能,并且接近多源上限。值得注意的是,验证集在域偏移一致的情况下特别有用:例如,在表 A.1 中,当以 Sketch 作为目标域时,所有方法的结果都提高了近 5%。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_176.jpg?x=302&y=283&w=1053&h=393&r=0

Figure A.3. Digits-five: plots of the entropy loss on single sample (blue) and on the average batch assignments (orange) for the domain classifier during the training phase.
图 A.3. Digits - five:训练阶段域分类器的单个样本熵损失(蓝色)和批次平均分配熵损失(橙色)的曲线图。

Table A.1. PACS dataset: comparison of different methods using the ResNet architecture. The first row indicates the target domain, while all the others are considered as sources. The numbers in parenthesis indicate the results using a target validation set for model selection.
表 A.1. PACS 数据集:使用 ResNet 架构的不同方法的比较。第一行表示目标域,其他行均视为源域。括号中的数字表示使用目标验证集进行模型选择的结果。

MethodSketchPhotoArtCartoonMean
ResNet [98]60.192.974.772.475.0
DIAL [29]66.8 (71.3)97.0 (97.4)87.3 (87.5)85.5 (87.0)84.2 (85.8)
mDA70.7 (75.2)97.0 (97.3)87.4 (87.7)86.3 (87.2)85.4 (86.9)
Multi-source DA71.6 (78.1)96.6 (97.2)87.5 (88.7)87.0 (87.4)85.7 (87.9)
方法草图照片艺术卡通均值
残差网络(ResNet) [98]60.192.974.772.475.0
DIAL [29]66.8 (71.3)97.0 (97.4)87.3 (87.5)85.5 (87.0)84.2 (85.8)
多重去噪自编码器(mDA)70.7 (75.2)97.0 (97.3)87.4 (87.7)86.3 (87.2)85.4 (86.9)
多源领域自适应(Multi - source DA)71.6 (78.1)96.6 (97.2)87.5 (88.7)87.0 (87.4)85.7 (87.9)


As a final note, we underline that the use of a validation set on the target domain for unsupervised domain adaptation is not a common practice in the community, thus these results can be regarded as an upper bound with respect to our model.
最后,我们强调,在无监督领域自适应中使用目标域验证集并非该领域的常见做法,因此这些结果可视为我们模型性能的上限。

Table A.2. PACS dataset: comparison of different methods using the ResNet architecture on the multi-source multi-target setting. The first row indicates the two target domains. The numbers in parenthesis indicate the results using a target validation set for model selection.
表A.2. PACS数据集:在多源多目标设置下,使用ResNet架构的不同方法的比较。第一行表示两个目标域。括号中的数字表示使用目标验证集进行模型选择的结果。

MethodPhoto-ArtPhoto-CartoonPhoto-SketchArt-CartoonArt-SketchCartoon-SketchMean
ResNet [98]71.484.281.462.270.354.270.6
DIAL [29]86.7 (87.5)86.5 (87.1)86.8(88.2)77.1 (78.7)72.1 (74.2)67.7 (70.4)79.5 (81.0)
mDA87.2 (87.7)88.1 (88.5)88.7 (89.7)77.7 (79.6)81.3 (82.2)77.0 (79.3)83.3 (84.5)
Multi-source/ target DA87.7(88.8)88.9(89.8)86.8(88.3)79.0 (79.5)79.8 (82.2)75.6 (79.1)83.0(84.6)
方法照片艺术(Photo-Art)照片卡通(Photo-Cartoon)照片素描(Photo-Sketch)艺术卡通(Art-Cartoon)艺术素描(Art-Sketch)卡通素描(Cartoon-Sketch)均值
残差网络(ResNet [98])71.484.281.462.270.354.270.6
DIAL [29](原文未明确,保留英文)86.7 (87.5)86.5 (87.1)86.8(88.2)77.1 (78.7)72.1 (74.2)67.7 (70.4)79.5 (81.0)
多去噪自编码器(mDA,原文未明确,保留英文)87.2 (87.7)88.1 (88.5)88.7 (89.7)77.7 (79.6)81.3 (82.2)77.0 (79.3)83.3 (84.5)
多源/目标域适应(Multi-source/ target DA)87.7(88.8)88.9(89.8)86.8(88.3)79.0 (79.5)79.8 (82.2)75.6 (79.1)83.0(84.6)


A.2 Predictive Domain Adaptation
A.2 预测性领域自适应


A.2.1 Metadata Details
A.2.1 元数据详情


CompCars. For the experiments with the CompCars dataset [292], we have two domain information: the car production year and the viewpoint. We encode the metadata through a 2-dimensional integer vector where the first integer encodes the year of production (between 2009 and 2014) and the second the viewpoint. While encoding the production year is straightforward, for the viewpoint we use the same criterion adopted in [293], i.e. we encode the viewpoint through integers between 1-5 in the order: Front, Front-Side, Side, Rear-Side, Rear.
CompCars数据集。在使用CompCars数据集[292]进行的实验中,我们有两个领域信息:汽车生产年份和视角。我们通过一个二维整数向量对元数据进行编码,其中第一个整数对生产年份(2009年至2014年之间)进行编码,第二个整数对视角进行编码。虽然对生产年份进行编码很直接,但对于视角,我们使用了[293]中采用的相同标准,即我们按以下顺序通过1 - 5之间的整数对视角进行编码:前部、前侧、侧面、后侧、后部。

Portraits. For the experiments with the Portraits dataset [82], we have again two domain information: the year and the region where the picture has been taken. To allow for a bit more precise geographical information we encode the metadata through a 3-dimensional integer vector.
肖像数据集。在使用肖像数据集[82]进行的实验中,我们同样有两个领域信息:拍摄照片的年份和地区。为了获得更精确的地理信息,我们通过一个三维整数向量对元数据进行编码。

As for the CompCars dataset, the first integer encodes the decade of the image ( 8 decades between 1934 and 2014), while the second and third the geographical position. For the geographical position we simplify the representation through a coarse encoding involving 2 directions: est-west (from 0 to 1) and north-south (from 0 to 3). In particular we assign the following value pairs ([north-south, east-west]): Mid-Atlantic [0,1] ,Midwestern [0,2] ,New England [0,0] ,Pacific [0,3] and Southern [1,1] . Each component of the vector has been normalized in the range 0-1 .
与CompCars数据集一样,第一个整数对图像的年代(1934年至2014年之间的8个年代)进行编码,而第二个和第三个整数对地理位置进行编码。对于地理位置,我们通过一种粗略的编码方式简化表示,涉及两个方向:东西方向(从0到1)和南北方向(从0到3)。具体来说,我们分配以下值对([南北,东西]):大西洋中部地区[0,1]、中西部地区[0,2]、新英格兰地区[0,0]、太平洋地区[0,3]和南部地区[1,1]。向量的每个分量都在0 - 1的范围内进行了归一化。

A.2.2 Additional Analysis
A.2.2 额外分析


ResNet-18 on CompCars
CompCars数据集上的ResNet - 18


Here we apply AdaGraph to the ResNet-18 architecture in the CompCars dataset [292]. As for the other experiments, we apply AdaGraph by replacing each BN layer of the network with its GBN counterpart.
在这里,我们将AdaGraph应用于CompCars数据集[292]中的ResNet - 18架构。与其他实验一样,我们通过将网络的每个批量归一化(BN)层替换为其广义批量归一化(GBN)对应层来应用AdaGraph。

The network is initialized with the weights of the model pretrained on ImageNet. We train the network for 6 epochs on the source dataset, employing Adam as optimizer with a weight decay of 106 and a batch-size of 16 . The learning rate is set to 103 for the classifier and 104 for the rest of the network and it is decayed by a factor of 10 after 4 epochs. We extract domain-specific parameters by training the network for 1 epoch on the union of source and auxiliary domains, keeping the same optimizer and hyperparameters. The batch size is kept to 16 , building each batch with elements of a single pair production year-viewpoint belonging to one of the domains available during training (either auxiliary or source).
网络使用在ImageNet上预训练的模型的权重进行初始化。我们在源数据集上对网络进行6个轮次的训练,使用Adam作为优化器,权重衰减为106,批量大小为16。分类器的学习率设置为103,网络其余部分的学习率设置为104,并且在4个轮次后学习率衰减为原来的1/10。我们通过在源域和辅助域的并集上对网络进行1个轮次的训练来提取特定领域的参数,保持相同的优化器和超参数。批量大小保持为16,每个批量由属于训练期间可用的某个域(辅助域或源域)的单个生产年份 - 视角对的元素组成。

The results are shown in Table A.3. As the table shows, AdaGraph largely increases the performance of the Baseline model. Coherently with previous experiments, our refinement strategy is able to further increase the performances of AdaGraph, filling almost entirely the gap with the DA upper bound.
结果如表A.3所示。如表所示,AdaGraph大大提高了基线模型的性能。与之前的实验一致,我们的细化策略能够进一步提高AdaGraph的性能,几乎完全填补了与领域自适应(DA)上限的差距。

Table A.3. CompCars dataset [292]. Results with ResNet-18 architecture.
表A.3. CompCars数据集[292]。ResNet - 18架构的实验结果。

MethodAvg. Accuracy
Baseline56.8
AdaGraph65.1
Baseline + Refinement65.3
AdaGraph + Refinement66.7
DA upper bound66.9
方法平均准确率
基线(Baseline)56.8
自适应图(AdaGraph)65.1
基线 + 细化65.3
自适应图 + 细化66.7
领域自适应上限(DA upper bound)66.9


Performances vs Number of Auxiliary Domains
性能与辅助域数量的关系


In this section, we analyze the impact of varying the number of available auxiliary domains on the performances of our model. We employ the ResNet-18 architecture on the Portraits dataset, with the same setting and set of hyperparameters described in the experimental section. However, differently from the previous experiments, we vary the number of available auxiliary domains, from 1 to 38 . We repeat the experiments 20 times, randomly sampling the available auxiliary domains each time.
在本节中,我们分析了可用辅助域数量的变化对我们模型性能的影响。我们在肖像数据集上采用ResNet - 18架构,使用实验部分描述的相同设置和超参数集。然而,与之前的实验不同,我们将可用辅助域的数量从1个变化到38个。我们将实验重复20次,每次都随机采样可用的辅助域。

The results are shown in Figure A.4. As expected, increasing the number of auxiliary domains leads to an increase in the performance of the model. In general, as we have more than 20 domains available, the performance of our model are close to the DA upper bound. While these results obviously depend on the relatedness between the auxiliary domains and the target, the plots show that having a large set of auxiliary domains may not be strictly necessary for achieving good performances.
结果如图A.4所示。正如预期的那样,增加辅助域的数量会导致模型性能的提升。一般来说,当我们有超过20个可用域时,我们模型的性能接近域适应(Domain Adaptation,DA)上限。虽然这些结果显然取决于辅助域与目标之间的相关性,但图表显示,拥有大量的辅助域对于实现良好的性能可能并非绝对必要。


https://cdn.noedgeai.com/0195d707-35c1-78bd-a59f-97733327d1b9_180.jpg?x=428&y=264&w=793&h=1650&r=0

Figure A.4. Portraits dataset: performances of AdaGraph with respect to the number of auxiliary domains available for different source-target pairs. The years reported in the captions indicate the starting year of source and target decades.
图A.4. 肖像数据集:对于不同的源 - 目标对,AdaGraph的性能与可用辅助域数量的关系。标题中报告的年份表示源和目标十年的起始年份。


Appendix B
附录B


Recognizing New Semantic Concepts
识别新的语义概念


B.1 Incremental Learning in Semantic Segmentation
B.1 语义分割中的增量学习


B.1.1 How should we use the background?
B.1.1 我们应该如何利用背景?


As highlighted in Section 3.4, an important design choice for incremental learning in semantic segmentation is how to use the background. In particular, since the background class is present both in old and new classes, it can be considered either in the supervised cross-entropy loss, in the distillation component or in both. For our MiB method and all the baselines (LwF [144], Lwf-MC [216], ILT [178]), we considered the latter case (i.e. background in both). However, a natural question arises on how different choices for the background would impact the final results. In this section we investigate this point.
正如第3.4节所强调的,语义分割中增量学习的一个重要设计选择是如何利用背景。特别是,由于背景类别在旧类别和新类别中都存在,它可以在有监督的交叉熵损失、蒸馏组件中考虑,或者在两者中都考虑。对于我们的MiB方法和所有基线方法(LwF [144]、Lwf - MC [216]、ILT [178]),我们考虑了后一种情况(即背景在两者中都考虑)。然而,一个自然的问题出现了,即对背景的不同选择会如何影响最终结果。在本节中,我们将研究这一点。

We start from the LwF-MC [216] baseline, since it is composed of multiple binary classifiers and allows to easy decouple modifications on the background from the other classes. We then test two variants:
我们从LwF - MC [216]基线开始,因为它由多个二分类器组成,并且可以轻松地将对背景的修改与其他类别解耦。然后我们测试两个变体:

  • LwF-MC-D ignores the background in the classification loss, using as target for the background the probability given by fθt1 .
  • LwF - MC - D在分类损失中忽略背景,使用fθt1给出的概率作为背景的目标。

  • LwF-MC-C ignores the background in the distillation loss, using only the supervised signal from the ground-truth.
  • LwF - MC - C在蒸馏损失中忽略背景,仅使用来自真实标签的有监督信号。

In Table B. 1 and B. 2 we report the results of the two variants for the overlapped scenarios of the Pascal VOC dataset and the 50-50 scenario of ADE20K respectively. Together with the two variants, we report the results of our method (MiB), the offline training upper-bound (Joint) and the LwF-MC version employed in Section 3.4.3 which uses the background in both binary cross-entropy and distillation, blending the two components with a hyper-parameter.
在表B.1和B.2中,我们分别报告了这两个变体在Pascal VOC数据集的重叠场景和ADE20K的50 - 50场景下的结果。除了这两个变体,我们还报告了我们的方法(MiB)、离线训练上限(Joint)以及在第3.4.3节中使用的LwF - MC版本的结果,该版本在二元交叉熵和蒸馏中都使用背景,并使用一个超参数混合这两个组件。

As the tables show, the three variants of Lwf-MC exhibit different trade-offs among learning new knowledge and remembering the past one. In particular, LwF-MC-C learns very well new classes, being always the most performing variant on the last incremental step. However, it suffers a significant drop in the old knowledge, showing its inability to tackle the catastrophic forgetting problem.
如表所示,Lwf - MC的三个变体在学习新知识和记忆旧知识之间表现出不同的权衡。特别是,LwF - MC - C在学习新类别方面表现非常好,在最后一个增量步骤中始终是性能最好的变体。然而,它在旧知识方面出现了显著下降,表明它无法解决灾难性遗忘问题。

Table B.1. Comparison of different implementations of LwF-MC on the Pascal-VOC 2012 overlapped setup.
表B.1. 在Pascal - VOC 2012重叠设置下不同LwF - MC实现的比较。

19-115-515-1
Method1-1920all1-1516-20all1-1516-20all
LwF-MC-C44.617.643.241.642.241.84.48.65.4
LwF-MC64.413.361.958.135.052.36.48.46.9
LwF-MC-D71.33.668.073.721.060.541.13.131.6
MiB70.222.167.875.549.469.035.113.529.7
Joint77.478.077.479.172.677.479.172.677.4
19-115-515-1
方法1-1920全部1-1516-20全部1-1516-20all
带多分类约束的学习遗忘法(LwF-MC-C)44.617.643.241.642.241.84.48.65.4
多分类学习遗忘法(LwF-MC)64.413.361.958.135.052.36.48.46.9
带多分类蒸馏的学习遗忘法(LwF-MC-D)71.33.668.073.721.060.541.13.131.6
兆字节(MiB)70.222.167.875.549.469.035.113.529.7
联合77.478.077.479.172.677.479.172.677.4


LwF-MC-D shows the opposite trend. It maintains very well the old knowledge, being the best variant in old classes for every setting. However, it is very intransigent [36] i.e. it is not able to correctly learn new classes, thus obtaining the worst performances on them.
LwF - MC - D呈现出相反的趋势。它能很好地保留旧知识,在每种设置下的旧类别中都是表现最佳的变体。然而,它非常顽固([36]),即它无法正确学习新类别,因此在新类别上的表现最差。

As expected, LwF-MC which considers the background in both cross-entropy and distillation achieves a trade-off among learning new knowledge, as in LwF-MC-C, while preserving the old one, as in LwF-MC-D.
正如预期的那样,LwF - MC在交叉熵和蒸馏中都考虑了背景信息,它在学习新知识(如LwF - MC - C)和保留旧知识(如LwF - MC - D)之间取得了平衡。

As the tables show, our MiB approach models the background more effectively, achieving the best trade-off among learning new knowledge and preserving old concepts. In particular, our method is the best by a margin in all scenarios for the new classes, while for old ones it is either better or comparable to the performance of the intransigent LwF-MC-D method. The only scenarios where it shows lower performances are the multi-step ones. Indeed in these scenarios, the multiple learning episodes make preserving old knowledge harder, and an intransigent method is less prone to forgetting since it is biased to old classes. However, the intransigence is not the right solution if the number of old and new classes are balanced, as in the 50-50 scenario of ADE20k, since the overall performances will be damaged.
如表所示,我们的MiB方法能更有效地对背景进行建模,在学习新知识和保留旧概念之间取得了最佳平衡。特别是,我们的方法在所有场景下的新类别上都有显著优势,而在旧类别上,其表现要么优于要么与顽固的LwF - MC - D方法相当。唯一表现较差的场景是多步场景。实际上,在这些场景中,多次学习过程使得保留旧知识变得更加困难,而顽固的方法由于偏向旧类别,不太容易遗忘。然而,如果旧类别和新类别的数量相当,如在ADE20k的50 - 50场景中,这种顽固并非正确的解决方案,因为整体性能会受到损害。


Table B.2. Comparison of different implementations of LwF-MC on the 50-50 setting of the ADE20K dataset.
表B.2. ADE20K数据集50 - 50设置下LwF - MC不同实现方式的比较。

Method1-5051-100101-150all
LwF-MC-C8.07.219.311.5
LwF-MC27.87.010.415.1
LwF-MC-D39.110.96.718.7
MiB35.522.223.627.0
Joint51.138.328.238.9
方法1-5051-100101-150全部
带多分类约束的学习无遗忘方法(LwF-MC-C)8.07.219.311.5
多分类学习无遗忘方法(LwF-MC)27.87.010.415.1
带多分类蒸馏的学习无遗忘方法(LwF-MC-D)39.110.96.718.7
兆字节(MiB)35.522.223.627.0
联合51.138.328.238.9


B.1.2 Per class results on Pascal-VOC 2012
B.1.2 Pascal - VOC 2012数据集各类别的结果


From Table B. 3 to B.8, we report the results for all classes of the Pascal-VOC 2012 dataset. As the tables show, MiB achieves the best results in the majority of classes (i.e. at least 14/20 in the 19-1 scenarios, 13/20 in the 15-5 and 16/20 in the 15-1 ones) being either the second best or comparable to the top two in all the others. Remarkable cases are the ones where we learn classes that are either similar in appearance (e.g. bus and train) or appear in similar contexts (e.g. sheep and cow): for those pairs, our model outperforms the competitors by a margin in both old classes (i.e. bus and cow in the 15-5 and 15-1 scenarios) and new ones (i.e. sheep and train). These results show the capability of MiB to not only learn new knowledge while preserving the old one, but also to learn discriminative features for difficult cases during different learning steps.
从表B.3到表B.8,我们报告了Pascal - VOC 2012数据集所有类别的结果。如表中所示,MiB(多增量分支,Multi - Incremental Branch)在大多数类别中取得了最佳结果(即在19 - 1场景中至少达到14/20,在15 - 5场景中达到13/20,在15 - 1场景中达到16/20),在其他所有类别中要么是第二好的,要么与前两名相当。值得注意的情况是,当我们学习外观相似(例如,公交车和火车)或出现在相似上下文(例如,绵羊和奶牛)的类别时:对于这些类别对,我们的模型在旧类别(即在15 - 5和15 - 1场景中的公交车和奶牛)和新类别(即绵羊和火车)上都大幅优于竞争对手。这些结果表明,MiB不仅能够在保留旧知识的同时学习新知识,还能够在不同的学习步骤中为困难情况学习判别性特征。

Table B.3. Per Class Mean IoU on 19-1 setting of Pascal-VOC 2012. disjoint setup
表B.3. Pascal - VOC 2012数据集19 - 1设置下各类别的平均交并比(Mean IoU)。不相交设置

MethodaerobikebirdboatbottlebuscarcatchairCOWtabledoghorsembikepersnplantsheepsofatraintv1-19all
FT11.92.11.111.64.86.913.50.20.03.814.40.51.54.70.015.82.81.813.512.35.86.2
PI [300]22.31.93.44.92.110.68.50.10.13.112.80.23.84.60.010.05.01.18.514.15.45.9
EWC [118]50.77.721.024.121.835.843.911.62.027.021.123.018.719.41.527.841.55.637.416.023.222.9
RW [36]45.85.315.122.817.828.940.97.51.322.420.314.513.716.30.825.331.84.833.315.719.419.2
LwF [144]28.140.553.138.847.446.463.683.535.860.148.876.565.367.183.250.261.242.514.29.153.050.8
LwF-MC [216]79.441.375.647.951.069.675.478.535.166.649.072.773.871.684.957.567.742.756.813.263.060.5
ILT [178]83.740.880.859.158.477.682.482.338.981.750.884.886.681.083.356.482.243.857.516.469.166.4
MiB78.040.585.751.664.479.177.889.939.282.355.486.282.772.283.656.686.245.165.025.669.667.4
Joint90.242.289.569.182.392.590.094.239.287.656.491.286.888.086.862.388.449.585.078.077.477.4
方法航空(aero)自行车瓶子公共汽车汽车椅子奶牛桌子摩托车(mbike)人(persn)植物绵羊沙发火车电视1-19全部
FT(未明确,保留英文)11.92.11.111.64.86.913.50.20.03.814.40.51.54.70.015.82.81.813.512.35.86.2
PI [300](未明确,保留英文)22.31.93.44.92.110.68.50.10.13.112.80.23.84.60.010.05.01.18.514.15.45.9
EWC [118](未明确,保留英文)50.77.721.024.121.835.843.911.62.027.021.123.018.719.41.527.841.55.637.416.023.222.9
RW [36](未明确,保留英文)45.85.315.122.817.828.940.97.51.322.420.314.513.716.30.825.331.84.833.315.719.419.2
LwF [144](未明确,保留英文)28.140.553.138.847.446.463.683.535.860.148.876.565.367.183.250.261.242.514.29.153.050.8
LwF - MC [216](未明确,保留英文)79.441.375.647.951.069.675.478.535.166.649.072.773.871.684.957.567.742.756.813.263.060.5
ILT [178](未明确,保留英文)83.740.880.859.158.477.682.482.338.981.750.884.886.681.083.356.482.243.857.516.469.166.4
兆字节(MiB)78.040.585.751.664.479.177.889.939.282.355.486.282.772.283.656.686.245.165.025.669.667.4
联合90.242.289.569.182.392.590.094.239.287.656.491.286.888.086.862.388.449.585.078.077.477.4

Table B.4. Per Class Mean IoU on 19-1 setting of Pascal-VOC 2012. overlapped setup
表B.4. Pascal - VOC 2012数据集19 - 1设置下各类别的平均交并比(Mean IoU)。重叠设置

MethodaerobikebirdboatbottlebuscarcatchairCOWtabledoghorsembikepersnplantsheepsofatraintv1-19all
FT23.71.91.59.36.916.98.50.00.09.55.30.12.98.80.015.11.00.716.012.96.87.1
PI [300]33.14.13.610.58.414.713.30.00.12.44.70.13.37.90.014.70.82.717.814.07.57.8
EWC [118]60.714.821.233.836.954.445.62.61.433.013.319.123.839.22.234.621.86.447.114.026.926.3
RW [36]57.512.115.429.632.950.740.01.30.830.710.712.618.632.90.830.717.55.542.714.223.322.9
LwF [144]36.635.162.032.947.531.651.577.936.567.744.371.468.666.282.249.658.741.111.98.551.249.1
LwF-MC [216]67.237.977.840.657.054.577.488.437.276.849.183.482.371.085.255.681.946.054.913.364.461.9
ILT [178]87.239.080.653.557.080.376.074.337.681.144.683.184.481.682.454.582.738.956.112.367.164.4
MiB78.136.286.849.472.780.878.290.838.382.051.986.782.876.983.858.884.445.768.522.170.267.8
Joint90.242.289.569.182.392.590.094.239.287.656.491.286.888.086.862.388.449.585.078.077.477.4
方法航空(aero)自行车瓶子公共汽车汽车椅子奶牛桌子摩托车(mbike)人(persn)植物绵羊沙发火车电视1-19全部
FT(未明确,保留英文)23.71.91.59.36.916.98.50.00.09.55.30.12.98.80.015.11.00.716.012.96.87.1
PI [300](未明确,保留英文)33.14.13.610.58.414.713.30.00.12.44.70.13.37.90.014.70.82.717.814.07.57.8
EWC [118](未明确,保留英文)60.714.821.233.836.954.445.62.61.433.013.319.123.839.22.234.621.86.447.114.026.926.3
RW [36](未明确,保留英文)57.512.115.429.632.950.740.01.30.830.710.712.618.632.90.830.717.55.542.714.223.322.9
LwF [144](未明确,保留英文)36.635.162.032.947.531.651.577.936.567.744.371.468.666.282.249.658.741.111.98.551.249.1
LwF - MC [216](未明确,保留英文)67.237.977.840.657.054.577.488.437.276.849.183.482.371.085.255.681.946.054.913.364.461.9
ILT [178](未明确,保留英文)87.239.080.653.557.080.376.074.337.681.144.683.184.481.682.454.582.738.956.112.367.164.4
兆字节(MiB)78.136.286.849.472.780.878.290.838.382.051.986.782.876.983.858.884.445.768.522.170.267.8
联合;共同90.242.289.569.182.392.590.094.239.287.656.491.286.888.086.862.388.449.585.078.077.477.4

Table B.5. Per Class Mean IoU on 15-5 setting of Pascal-VOC 2012. disjoint setup
表B.5. Pascal - VOC 2012数据集15 - 5设置下的各类平均交并比(Mean IoU)。不相交设置

MethodaerobikebirdboatbottlebuscarcatchairCOWtabledoghorsembikepersnplantsheepsofatraintv1-1516-20all
FT6.10.00.28.30.10.00.10.00.00.00.00.01.80.00.024.624.336.232.550.21.133.69.2
PI [300]8.80.00.210.50.00.00.10.00.00.00.00.00.40.00.025.624.734.334.152.01.334.19.5
EWC [118]58.84.156.446.244.44.367.43.62.314.810.312.451.620.42.928.832.235.635.556.326.737.729.4
RW [36]51.11.536.942.927.52.147.41.11.26.15.33.131.210.51.027.729.835.734.756.617.936.922.7
LwF [144]63.140.172.452.167.06.780.384.231.15.751.382.075.079.485.635.327.137.037.050.558.437.453.1
LwF-MC [216]78.142.378.962.178.647.384.689.135.026.250.586.677.684.986.035.035.240.849.245.967.241.260.7
ILT [178]79.442.080.563.980.412.886.090.230.76.753.383.273.080.785.036.929.936.838.355.763.239.557.3
MiB84.439.487.565.277.861.086.090.935.360.353.088.280.482.485.328.746.034.754.452.771.843.364.7
Joint90.242.289.569.182.392.590.094.239.287.656.491.286.888.086.862.388.449.585.078.079.172.677.4
方法航空(aero)自行车瓶子公共汽车汽车椅子奶牛桌子摩托车(mbike)人(persn)植物绵羊沙发火车电视1-1516-20全部
FT(未明确,保留英文)6.10.00.28.30.10.00.10.00.00.00.00.01.80.00.024.624.336.232.550.21.133.69.2
PI [300](未明确,保留英文)8.80.00.210.50.00.00.10.00.00.00.00.00.40.00.025.624.734.334.152.01.334.19.5
EWC [118](未明确,保留英文)58.84.156.446.244.44.367.43.62.314.810.312.451.620.42.928.832.235.635.556.326.737.729.4
RW [36](未明确,保留英文)51.11.536.942.927.52.147.41.11.26.15.33.131.210.51.027.729.835.734.756.617.936.922.7
LwF [144](未明确,保留英文)63.140.172.452.167.06.780.384.231.15.751.382.075.079.485.635.327.137.037.050.558.437.453.1
LwF - MC [216](未明确,保留英文)78.142.378.962.178.647.384.689.135.026.250.586.677.684.986.035.035.240.849.245.967.241.260.7
ILT [178](未明确,保留英文)79.442.080.563.980.412.886.090.230.76.753.383.273.080.785.036.929.936.838.355.763.239.557.3
兆字节(MiB)84.439.487.565.277.861.086.090.935.360.353.088.280.482.485.328.746.034.754.452.771.843.364.7
联合90.242.289.569.182.392.590.094.239.287.656.491.286.888.086.862.388.449.585.078.079.172.677.4

Table B.6. Per Class Mean IoU on 15-5 setting of Pascal-VOC 2012. overlapped setup
表B.6. Pascal - VOC 2012数据集15 - 5设置下各类别的平均交并比(Mean IoU)。重叠设置

MethodaerobikebirdboatbottlebuscarcatchairCOWtabledoghorsembikepersnplantsheepsofatraintv1-1516-20all
FT13.40.10.015.60.80.00.30.00.00.00.00.00.90.00.030.921.632.834.945.12.133.19.8
PI [300]7.80.00.012.90.30.00.30.00.00.00.00.02.70.00.033.222.233.236.142.01.633.39.5
EWC [118]67.312.850.552.935.024.741.71.21.09.85.73.742.915.40.631.826.332.142.045.024.335.527.1
RW [36]61.26.733.848.124.49.322.30.30.53.50.21.131.86.40.132.125.831.938.745.916.634.921.2
LwF [144]64.540.272.856.957.39.582.688.633.28.948.481.975.078.284.934.727.833.139.648.058.936.653.3
LwF-MC [216]60.638.974.741.667.210.881.488.838.74.347.482.269.978.985.828.428.534.136.447.858.135.052.3
ILT [178]77.440.378.961.978.753.586.188.733.815.951.183.280.279.885.039.530.931.049.352.666.340.659.9
MiB86.639.388.966.180.886.690.192.538.064.656.489.680.586.585.730.252.931.373.259.575.549.469.0
Joint90.242.289.569.182.392.590.094.239.287.656.491.286.888.086.862.388.449.585.078.079.172.677.4
方法航空(aero)自行车瓶子公共汽车汽车椅子奶牛桌子摩托车(mbike)人(persn)植物绵羊沙发火车电视1-1516-20全部
FT(未明确,保留英文)13.40.10.015.60.80.00.30.00.00.00.00.00.90.00.030.921.632.834.945.12.133.19.8
PI [300](未明确,保留英文)7.80.00.012.90.30.00.30.00.00.00.00.02.70.00.033.222.233.236.142.01.633.39.5
EWC [118](未明确,保留英文)67.312.850.552.935.024.741.71.21.09.85.73.742.915.40.631.826.332.142.045.024.335.527.1
RW [36](未明确,保留英文)61.26.733.848.124.49.322.30.30.53.50.21.131.86.40.132.125.831.938.745.916.634.921.2
LwF [144](未明确,保留英文)64.540.272.856.957.39.582.688.633.28.948.481.975.078.284.934.727.833.139.648.058.936.653.3
LwF - MC [216](未明确,保留英文)60.638.974.741.667.210.881.488.838.74.347.482.269.978.985.828.428.534.136.447.858.135.052.3
ILT [178](未明确,保留英文)77.440.378.961.978.753.586.188.733.815.951.183.280.279.885.039.530.931.049.352.666.340.659.9
兆字节(MiB)86.639.388.966.180.886.690.192.538.064.656.489.680.586.585.730.252.931.373.259.575.549.469.0
联合90.242.289.569.182.392.590.094.239.287.656.491.286.888.086.862.388.449.585.078.079.172.677.4

Table B.7. Per Class Mean IoU on 15-1 setting of Pascal-VOC 2012. disjoint setup
表B.7. Pascal - VOC 2012数据集15 - 1设置下的各类平均交并比(Mean IoU),不相交设置

MethodaerobikebirdboatbottlebuscarcatchairCOWtabledoghorsembikepersnplantsheepsofatraintv1-1516-20all
FT0.30.00.02.50.00.00.00.00.00.00.00.00.00.00.00.00.00.00.08.80.21.80.6
PI [300]0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.38.60.01.80.4
EWC [118]0.00.00.01.00.00.00.00.00.00.03.60.00.00.00.00.00.07.37.07.40.34.31.3
RW [36]0.00.00.00.20.00.00.00.00.00.02.20.00.00.00.00.00.08.110.58.20.25.41.5
LwF [144]0.00.00.00.00.60.00.00.00.00.00.00.00.00.010.70.00.01.98.27.90.83.61.5
LwF-MC [216]0.06.30.80.01.10.00.10.30.00.00.00.00.20.059.00.09.52.911.911.04.57.05.2
ILT [178]3.70.02.90.012.80.00.00.10.00.021.20.10.40.613.60.00.011.68.38.53.75.74.2
MiB53.638.953.617.762.736.571.260.11.135.28.157.655.062.179.410.214.211.918.210.146.212.937.9
Joint90.242.289.569.182.392.590.094.239.287.656.491.286.888.086.862.388.449.585.078.079.172.677.4
方法航空(aero)自行车瓶子公共汽车汽车椅子奶牛桌子摩托车(mbike)人(persn)植物绵羊沙发火车电视1-1516-20全部
FT(未明确,保留英文)0.30.00.02.50.00.00.00.00.00.00.00.00.00.00.00.00.00.00.08.80.21.80.6
PI [300](未明确,保留英文)0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.38.60.01.80.4
EWC [118](未明确,保留英文)0.00.00.01.00.00.00.00.00.00.03.60.00.00.00.00.00.07.37.07.40.34.31.3
RW [36](未明确,保留英文)0.00.00.00.20.00.00.00.00.00.02.20.00.00.00.00.00.08.110.58.20.25.41.5
LwF [144](未明确,保留英文)0.00.00.00.00.60.00.00.00.00.00.00.00.00.010.70.00.01.98.27.90.83.61.5
LwF - MC [216](未明确,保留英文)0.06.30.80.01.10.00.10.30.00.00.00.00.20.059.00.09.52.911.911.04.57.05.2
ILT [178](未明确,保留英文)3.70.02.90.012.80.00.00.10.00.021.20.10.40.613.60.00.011.68.38.53.75.74.2
兆字节(MiB)53.638.953.617.762.736.571.260.11.135.28.157.655.062.179.410.214.211.918.210.146.212.937.9
联合90.242.289.569.182.392.590.094.239.287.656.491.286.888.086.862.388.449.585.078.079.172.677.4

Table B.8. Per Class Mean IoU on 15-1 setting of Pascal-VOC 2012. overlapped setup
表B.8. Pascal - VOC 2012数据集15 - 1设置下各类别的平均交并比(Mean IoU),重叠设置

Methodaerobikebirdboatbottlebuscarcatchaircowtabledoghorsembikepersnplantsheepsofatraintv1-1516-20all
FT2.60.00.00.70.00.10.00.00.00.00.00.00.00.00.00.00.00.00.09.20.21.80.6
PI [300]0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.29.10.01.80.5
EWC [118]0.00.00.01.00.00.00.00.00.00.03.60.00.00.00.00.00.07.37.07.40.34.31.3
RW [36]0.10.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.08.711.26.30.05.21.3
LwF [144]3.70.10.02.50.20.00.00.00.00.00.00.00.10.09.00.00.01.68.98.81.03.91.8
LwF-MC [216]0.07.25.20.025.50.00.00.00.00.00.00.01.21.356.20.04.90.28.628.26.48.46.9
ILT [178]20.00.03.26.32.30.00.00.00.35.119.00.09.10.08.70.00.021.09.98.14.97.85.7
MiB31.325.426.726.946.131.063.652.80.111.09.452.441.228.180.717.613.115.315.36.235.113.529.7
Joint90.242.289.569.182.392.590.094.239.287.656.491.286.888.086.862.388.449.585.078.079.172.677.4
方法航空(aero)自行车瓶子公共汽车汽车椅子奶牛桌子摩托车(mbike)人(persn)植物绵羊沙发火车电视1-1516-20全部
FT(未明确,保留英文)2.60.00.00.70.00.10.00.00.00.00.00.00.00.00.00.00.00.00.09.20.21.80.6
PI [300](未明确,保留英文)0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.29.10.01.80.5
EWC [118](未明确,保留英文)0.00.00.01.00.00.00.00.00.00.03.60.00.00.00.00.00.07.37.07.40.34.31.3
RW [36](未明确,保留英文)0.10.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.08.711.26.30.05.21.3
LwF [144](未明确,保留英文)3.70.10.02.50.20.00.00.00.00.00.00.00.10.09.00.00.01.68.98.81.03.91.8
LwF - MC [216](未明确,保留英文)0.07.25.20.025.50.00.00.00.00.00.00.01.21.356.20.04.90.28.628.26.48.46.9
ILT [178](未明确,保留英文)20.00.03.26.32.30.00.00.00.35.119.00.09.10.08.70.00.021.09.98.14.97.85.7
兆字节(MiB)31.325.426.726.946.131.063.652.80.111.09.452.441.228.180.717.613.115.315.36.235.113.529.7
联合90.242.289.569.182.392.590.094.239.287.656.491.286.888.086.862.388.449.585.078.079.172.677.4


B.1.3 Validation protocol and hyper-parameters
B.1.3 验证协议和超参数


In this work, we follow the protocol of [49] for setting the hyper-parameters in continual learning. The protocol works in three steps and does not require any data of old tasks. First, we split the training set of the current learning step into train and validation sets. We use 80% of the data for training and 20% for validation. Note that the validation set contains only labels for the current learning step.
在这项工作中,我们遵循文献[49]的协议来设置持续学习中的超参数。该协议分三个步骤进行,且不需要旧任务的任何数据。首先,我们将当前学习步骤的训练集划分为训练集和验证集。我们使用80%的数据进行训练,20%的数据进行验证。请注意,验证集仅包含当前学习步骤的标签。

Second, we set general hyper-parameters values (e.g. learning rate) as the ones achieving the highest accuracy in the new set of classes with the fine-tuned model. Since we tested multiple methods, we wanted to ensure fairness in terms of hyper-parameters used, without producing biased results. To this extent, this step is held out only once starting from the fine-tuned model and fixing the hyper-parameters for all the methods. In particular,we set the learning rate as 103 for the incremental steps in all datasets and settings.
其次,我们将一般超参数值(例如学习率)设置为在微调模型的新类别集合中实现最高准确率的值。由于我们测试了多种方法,我们希望确保在使用的超参数方面的公平性,而不产生有偏差的结果。为此,从微调模型开始,这一步只进行一次,并为所有方法固定超参数。特别是,我们将所有数据集和设置的增量步骤的学习率设置为103

As a final step, we set the hyper-parameters specific of the continual learning method as the highest values (to ensure minimum forgetting) with a tolerated decay on the performance on the new classes with respect to the ones achieved by the fine-tuned model (to ensure maximum learning). We set the tolerated decay as 20% of the original performances,exploring hyper-parameters values of the form A10B ,with A{1,5} and B{3,,3} . We perform this validation procedure in the first learning step of each scenario, keeping the hyper-parameters fixed for the subsequent ones. Since this procedure is costly, we perform it only for the Pascal-VOC dataset, keeping the hyper-parameters for the large-scale ADE20k. As a result, for the prior focused methods, we obtain a weight of 500 for EWC [118] and PI [300] and 100 for RW [36] in all scenarios. For the data-focused methods we obtain a weight of 100 for the distillation loss of LwF [144], 10 for the one in LwF-MC [216] and 100 for both distillation losses in ILT [178], in all settings. For our MiB method, we obtain a distillation loss weight of 10 for all scenarios except for the 15-1 in Pascal VOC, where the weight is set to 100 .
作为最后一步,我们将持续学习方法的特定超参数设置为最高值(以确保最小遗忘),同时允许在新类别上的性能相对于微调模型所达到的性能有一定的衰减(以确保最大学习)。我们将允许的衰减设置为原始性能的20%,探索形式为A10B的超参数值,其中A{1,5}B{3,,3}。我们在每个场景的第一个学习步骤中执行此验证过程,并为后续步骤固定超参数。由于此过程成本较高,我们仅对Pascal - VOC数据集执行此过程,并为大规模的ADE20k数据集保留超参数。结果,对于先验聚焦方法,在所有场景中,我们为EWC [118]和PI [300]获得的权重为500,为RW [36]获得的权重为100。对于数据聚焦方法,在所有设置中,我们为LwF [144]的蒸馏损失获得的权重为100,为LwF - MC [216]的蒸馏损失获得的权重为10,为ILT [178]的两个蒸馏损失获得的权重均为100。对于我们的MiB方法,除了Pascal VOC中的15 - 1场景(权重设置为100)外,在所有场景中我们获得的蒸馏损失权重为10。

Appendix C Towards Recognizing Unseen Categories in Unseen Domains
附录C 迈向识别未见领域中的未见类别


C. 1 Recognizing Unseen Categories in Unseen Domains
C. 1 识别未见领域中的未见类别

C.1.1 Hyperparameter choices
C.1.1 超参数选择


In this section, we will provide additional details on the hyperparameter choices and validation protocols, not included in Section 4.3.
在本节中,我们将提供关于超参数选择和验证协议的额外细节,这些内容未包含在第4.3节中。

ZSL. For each dataset, we use the train, validation and test split provided by [278]. In all the settings we employ features extracted from the second-last layer of a ResNet-101 [98] pretrained on ImageNet as image representation, without end-to-end training. For CuMix,we consider f as the identity function and as g a simple fully connected layer, performing the mixing directly at the feature-level while applying our alignment loss in the embedding space (i.e. LM-IMG  and LM-F  coincide in this case and are applied only once.) All hyperparameters have been set dataset-wise following [278], using the available validation sets. For all the experiments, we use SGD as optimizer with an initial learning rate equal to 0.1 , momentum equal to 0.9 , a weight-decay set to 0.001 for all settings but AWA, where is set 0 . The learning-rate is downscaled by a factor of ten after 2/3 of the total number of epochs and N=30 . In particular,for CUB and FLO we train our model for 90 epochs,setting βmax=0.8 and ηI=ηF=10.0 for CUB,and βmax=0.4 and ηI=ηF=4.0 for FLO. For AWA, we train our network for 30 epochs,with βmax=0.2 and ηI=ηF=1.0 . For SUN,we train our network for 60 epochs,with βmax=0.8 and ηI=ηF=10 . In all settings, the batch-size is set to 128 .
零样本学习(ZSL)。对于每个数据集,我们使用文献[278]提供的训练集、验证集和测试集划分。在所有设置中,我们采用在ImageNet上预训练的ResNet - 101 [98]倒数第二层提取的特征作为图像表示,不进行端到端训练。对于CuMix,我们将f视为恒等函数,将g视为一个简单的全连接层,直接在特征级别进行混合,同时在嵌入空间中应用我们的对齐损失(即,在这种情况下LM-IMG LM-F 重合,且仅应用一次)。所有超参数都根据文献[278]按数据集进行设置,使用可用的验证集。对于所有实验,我们使用随机梯度下降法(SGD)作为优化器,初始学习率设为0.1,动量设为0.9,除了AWA数据集将权重衰减设为0外,其他设置的权重衰减均设为0.001。在总训练轮数的2/3N=30之后,学习率缩小为原来的十分之一。具体而言,对于加州大学伯克利分校鸟类数据集(CUB)和花卉数据集(FLO),我们对模型进行90轮训练,对于CUB数据集设置βmax=0.8ηI=ηF=10.0,对于FLO数据集设置βmax=0.4ηI=ηF=4.0。对于动物属性数据集(AWA),我们对网络进行30轮训练,设置βmax=0.2ηI=ηF=1.0。对于太阳场景数据集(SUN),我们对网络进行60轮训练,设置βmax=0.8ηI=ηF=10。在所有设置中,批量大小均设为128。

DG. We use as base architecture a ResNet-18 [98] pretrained on ImageNet. For our model,we consider f to be the ResNet-18, g to be the identity function and ω will be a learned, fully-connected classifier. We use the same training hyperparameters and protocol of [135],setting βmax=0.6,ηI=0.1,ηF=3 and N=10 .
领域泛化(DG)。我们使用在ImageNet上预训练的ResNet - 18 [98]作为基础架构。对于我们的模型,我们将f视为ResNet - 18,g视为恒等函数,ω将是一个经过学习的全连接分类器。我们使用文献[135]相同的训练超参数和协议,设置βmax=0.6,ηI=0.1,ηF=3N=10

ZSL+DG . For all the baselines and our method we employ as base architecture a ResNet-50 [98] pretrained on ImageNet, using SGD with momentum as optimizer, with a learning rate of 0.001 for the ZSL classifier and 0.0001 for the ResNet-50 backbone,a weight decay of 5105 and momentum 0.9 . We train the models for 8 epochs (each epoch counted on the smallest source dataset), with a batch-size containing 24 sample per domain. We decrease the learning rates by a factor of 10 after 6 epochs. For our model,we consider the backbone as f and a simple fully-connected layer as g . We set N=2,ηI=103 for all the experiments,while βmax in {1,2} and ηF in {0.5,1,2} depending on the scenario.
ZSL+DG。对于所有基线方法和我们的方法,我们使用在ImageNet上预训练的ResNet - 50 [98]作为基础架构,使用带有动量的随机梯度下降法(SGD)作为优化器,零样本学习分类器的学习率设为0.001,ResNet - 50骨干网络的学习率设为0.0001,权重衰减设为5105,动量设为0.9。我们对模型进行8轮训练(每一轮基于最小的源数据集计算),批量大小为每个领域包含24个样本。在6轮训练后,我们将学习率缩小为原来的十分之一。对于我们的模型,我们将骨干网络视为f,将一个简单的全连接层视为g。我们在所有实验中设置N=2,ηI=103,而βmax{1,2}中,ηF{0.5,1,2}中,具体取决于不同场景。

Table C.1. ZSL+DG scenario on the DomainNet dataset with ResNet-50 as backbone.
表C.1. 以ResNet - 50为骨干网络在DomainNet数据集上的零样本学习+领域泛化(ZSL + DG)场景。

MethodTarget Domain
DGZSLclipartinfographpaintingquickdrawsketchavg.
-DEVISE [73]20.111.717.66.116.714.4
ALE [1]22.712.720.26.818.516.2
SPNet [277]26.016.923.88.221.819.4
DANN [78]DEVISE [73]20.510.416.47.115.113.9
ALE [1]21.212.519.77.417.915.7
SPNet [277]25.915.824.18.421.319.1
EpiFCR [135]DEVISE [73]21.613.919.37.317.215.9
ALE [1]23.214.121.47.820.917.5
SPNet [277]26.416.724.69.223.220.0
CuMix27.617.825.59.922.620.7
方法目标领域
DG(域泛化,Domain Generalization)ZSL(零样本学习,Zero-Shot Learning)剪贴画信息图绘画快速绘图素描平均值
-DEVISE [73](原文未明确,保留英文)20.111.717.66.116.714.4
ALE [1](原文未明确,保留英文)22.712.720.26.818.516.2
SPNet [277](原文未明确,保留英文)26.016.923.88.221.819.4
DANN [78](原文未明确,保留英文)DEVISE [73](原文未明确,保留英文)20.510.416.47.115.113.9
ALE [1](原文未明确,保留英文)21.212.519.77.417.915.7
SPNet [277](原文未明确,保留英文)25.915.824.18.421.319.1
EpiFCR [135](原文未明确,保留英文)DEVISE [73](原文未明确,保留英文)21.613.919.37.317.215.9
ALE [1](原文未明确,保留英文)23.214.121.47.820.917.5
SPNet [277](原文未明确,保留英文)26.416.724.69.223.220.0
CuMix(原文未明确,保留英文)27.617.825.59.922.620.7


C.1.2 ZSL+DG: analysis of additional baselines
C.1.2 零样本学习+领域泛化:额外基线方法分析


In Table 4.3, we showed the performance of our method in the new ZSL+DG scenario on the DomainNet dataset [206], comparing it with three baselines: SPNet [277], simple mixup [301] coupled with SPNet and SPNet coupled with EpiFCR [135], an episodic-based method for DG. We reported the results of these baselines to show 1) the performance of a state-of-the-art ZSL method (SPNet), 2) the impact of mixup alone (mixup+SPNet) and 3) the results obtained by coupling state-of-the-art models for DG and for ZSL together (EpiFCR+SPNet). We chose SPNet and EpiFCR as state-of-the-art references for ZSL and DG respectively because they are very recent approaches achieving high performances on their respective scenarios.
在表4.3中,我们展示了我们的方法在DomainNet数据集[206]的新零样本学习+领域泛化(ZSL+DG)场景下的性能,并将其与三个基线方法进行了比较:SPNet[277]、与SPNet结合的简单混合增强(simple mixup)[301],以及与基于情节的领域泛化方法EpiFCR[135]结合的SPNet。我们报告这些基线方法的结果是为了展示:1)最先进的零样本学习方法(SPNet)的性能;2)单独使用混合增强(混合增强+SPNet)的影响;3)将最先进的领域泛化模型和零样本学习模型结合在一起(EpiFCR+SPNet)所获得的结果。我们分别选择SPNet和EpiFCR作为零样本学习和领域泛化的最先进参考方法,因为它们是最近提出的在各自场景中实现了高性能的方法。

In this section, we motivate our choices by showing that other baselines of ZSL and DG achieve lower performances in this new scenario. In particular we show the performances of two standard ZSL methods, ALE [1] and DEVISE [73] and a standard DG/DA method, DANN [78]. We choose DANN since it is a strong baseline for DG on residual architectures, as shown in [135]. As in Section 4.3, we show the performances of the ZSL methods alone, ZSL methods coupled with DANN, and with EpiFCR. For all methods, we keep the same training hyperparameters, tuning only the method-specific ones. The results are reported in Table C.1. As the table shows, CuMix achieves superior performances even compared to these additional baselines. Moreover, these baselines achieve lower results than the EpiFCR method coupled with SPNet, as expected. It is also worth highlighting how coupling ZSL methods with DANN for DG achieves lower performances than the ZSL methods alone in this scenario. This is in line with the results reported in [206], where standard domain alignment-based methods are shown to be not effective in the DomainNet dataset, leading also to negative transfer in some cases [206].
在本节中,我们通过展示零样本学习和领域泛化的其他基线方法在这个新场景中性能较低,来解释我们选择这些方法的原因。具体来说,我们展示了两种标准的零样本学习方法,即ALE[1]和DEVISE[73],以及一种标准的领域泛化/领域适应(DG/DA)方法DANN[78]的性能。我们选择DANN是因为正如文献[135]所示,它是基于残差架构的领域泛化的一个强大基线方法。与4.3节一样,我们展示了单独使用零样本学习方法、将零样本学习方法与DANN结合以及与EpiFCR结合的性能。对于所有方法,我们保持相同的训练超参数,仅调整特定方法的超参数。结果报告在表C.1中。如表所示,与这些额外的基线方法相比,CuMix实现了更优的性能。此外,正如预期的那样,这些基线方法的结果比与SPNet结合的EpiFCR方法的结果要差。还值得强调的是,在这个场景中,将零样本学习方法与用于领域泛化的DANN结合所实现的性能比单独使用零样本学习方法的性能要低。这与文献[206]中报告的结果一致,该文献表明基于标准领域对齐的方法在DomainNet数据集中并不有效,在某些情况下甚至会导致负迁移[206]。

Finally, we want to highlight that coupling EpiFCR with any of the ZSL baselines, is not a straightforward approach, but requires to actually adapt this method, restructuring the losses. In particular, we substitute the classifier originally designed for EpiFCR with the classifier specific of the ZSL method we apply on top of the backbone. Moreover, we additionally replace the classification loss with the loss devised for the particular ZSL method. For instance, for EpiFCR+SPNet, we use as classifier the semantic projection network, using the cross-entropy loss in [277] as classification loss. Similarly, for EpiFCR+DEVISE and EpiFCR+ALE, we use as classifier a bi-linear compatibility function [278] coupled with a pairwise ranking objective [73] and with a weighted pairwise ranking objective [1] respectively.
最后,我们想强调的是,将EpiFCR与任何零样本学习基线方法结合并不是一种直接的方法,而是需要实际调整该方法,重新构建损失函数。具体来说,我们用应用于骨干网络之上的零样本学习方法特定的分类器替换最初为EpiFCR设计的分类器。此外,我们还用为特定零样本学习方法设计的损失函数替换分类损失。例如,对于EpiFCR+SPNet,我们使用语义投影网络作为分类器,并使用文献[277]中的交叉熵损失作为分类损失。类似地,对于EpiFCR+DEVISE和EpiFCR+ALE,我们分别使用与成对排序目标[73]和加权成对排序目标[1]结合的双线性兼容性函数[278]作为分类器。

Table C.2. Results on DomainNet dataset with Real-Painting as sources and ResNet-50 as backbone.
表C.2. 以真实图像 - 绘画图像为源域、ResNet - 50为骨干网络在DomainNet数据集上的结果。

Method/TargetClipartInfographSketchQuickdrawAvg.
SPNet21.5±0.614.1±0.217.3±0.34.8±0.414.4
Epi-FCR+SPNet22.5±0.514.9±0.718.7±0.65.6±0.415.4
MixUp img only21.2±0.414.0±0.717.3±0.34.8±0.114.3
MixUp two-level22.7±0.316.5±0.419.1±0.44.9±0.315.8
CuMix reverse22.9±0.315.8±0.218.2±0.34.8±0.515.4
CuMix23.7±0.317.1±0.219.7±0.35.5±0.316.5
方法/目标剪贴画信息图草图快速绘图平均值
SP网络21.5±0.614.1±0.217.3±0.34.8±0.414.4
外延式全卷积回归网络+SP网络22.5±0.514.9±0.718.7±0.65.6±0.415.4
仅图像混合增强21.2±0.414.0±0.717.3±0.34.8±0.114.3
两级混合增强22.7±0.316.5±0.419.1±0.44.9±0.315.8
反向CuMix22.9±0.315.8±0.218.2±0.34.8±0.515.4
CuMix23.7±0.317.1±0.219.7±0.35.5±0.316.5


C.1.3 ZSL+DG: ablation study
C.1.3 零样本学习+领域泛化:消融实验


In order to further investigate our design choices on the ZSL+DG setting, we conducted experiments on a challenging scenario where we consider just two domains as sources, i.e. Real and Painting. The results are shown in Table C.2. On average our model improves SPNet by 2% and SPNet + Epi-FCR by 1.1%. Our approach without curriculum largely outperforms standard image-level mixup [301] (more than 2%). Applying mixup at both feature and image level but without curriculum is effective but achieves still lower results with respect to our CuMix strategy (as in Tab. 2). Interestingly, if we apply the curriculum strategy but switching the order of semantic and domain mixing (CuMix reverse), this achieves lower performances with respect to CuMix, which considers domain mixing harder than semantic ones. This shows that, in this setting, it is important to correctly tackle intra-domain semantic mixing before including inter-domain ones.
为了进一步研究我们在零样本学习(ZSL)+领域泛化(DG)设置下的设计选择,我们在一个具有挑战性的场景中进行了实验,在该场景中我们仅将两个领域视为源领域,即真实图像和绘画图像。结果如表C.2所示。平均而言,我们的模型使SPNet的性能提高了2%,使SPNet + Epi - FCR的性能提高了1.1%。不采用课程学习的我们的方法大大优于标准的图像级混合(mixup)方法[301](超过2%)。在特征和图像级别都应用混合但不采用课程学习是有效的,但与我们的CuMix策略相比,其结果仍然较低(如表2所示)。有趣的是,如果我们应用课程学习策略但交换语义混合和领域混合的顺序(CuMix反向),与将领域混合视为比语义混合更难的CuMix相比,其性能较低。这表明,在这种设置下,在考虑跨领域混合之前,正确处理领域内语义混合是很重要的。

C.1.4 ZSL results
C.1.4 零样本学习结果


In this section, we report the ZSL results of Figure 4.3 in tabular form. The results are shown in Table C.3. Here, we also report the results of a baseline which uses just the cross-entropy loss term (similarly to [277]), without the mixing term employed in our CuMix method. As the table shows, our baseline is weak, performing below most of the ZSL methods in all scenarios but FLO. However, adding our mixing strategy allows to boost the performances in all scenarios, achieving state-of-the-art performances in most of them. We also want to highlight that in Table C.3, as in Figure 4.3, we do not report the results of methods based on generating features of unseen classes for ZSL [279,281] . This choice is linked to the fact that these methods can be used as data augmentation strategies to improve the performances
在本节中,我们以表格形式报告图4.3的零样本学习结果。结果如表C.3所示。在这里,我们还报告了一个仅使用交叉熵损失项(类似于[277])的基线模型的结果,该基线模型未使用我们的CuMix方法中的混合项。如表所示,我们的基线模型表现较弱,在除FLO之外的所有场景中,其性能都低于大多数零样本学习方法。然而,添加我们的混合策略可以在所有场景中提升性能,在大多数场景中达到了最先进的性能。我们还想强调的是,如表C.3和图4.3所示,我们没有报告基于为零样本学习生成未见类特征的方法的结果[279,281]。这一选择与这些方法可以用作数据增强策略来提高性能这一事实有关。

Table C.3. ZSL results.
表C.3. 零样本学习结果。

MethodCUBSUNAWA1FLO
ALE [1]54.958.159.948.5
SJE [2]53.953.765.653.4
SYNC [34]56.355.654.0-
GFZSL [265]49.360.668.3-
SPNet [277]56.560.766.2-
Baseline52.458.262.558.4
CuMix60.462.464.059.7
方法加州理工学院鸟类数据集(CUB)斯坦福大学场景数据集(SUN)动物属性数据集1(AWA1)花卉数据集(FLO)
ALE [1]54.958.159.948.5
SJE [2]53.953.765.653.4
SYNC [34]56.355.654.0-
广义零样本学习方法(GFZSL) [265]49.360.668.3-
SPNet [277]56.560.766.2-
基线52.458.262.558.4
CuMix60.462.464.059.7


of any ZSL method, as shown in [279]. While using them can improve the results of all the baselines as well as CuMix , this falls out of the scope of our work.
如文献[279]所示,这是任何零样本学习(ZSL)方法的情况。虽然使用它们可以提高所有基线方法以及CuMix的结果,但这超出了我们工作的范围。

Bibliography
参考文献


[1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for attribute-based classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 819-826, 2013.
[1] Z. Akata、F. Perronnin、Z. Harchaoui和C. Schmid。基于属性分类的标签嵌入。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第819 - 826页,2013年。

[2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2927-2936, 2015.
[2] Z. Akata、S. Reed、D. Walter、H. Lee和B. Schiele。细粒度图像分类输出嵌入的评估。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第2927 - 2936页,2015年。

[3] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), pages 139-154, 2018.
[3] R. Aljundi、F. Babiloni、M. Elhoseiny、M. Rohrbach和T. Tuytelaars。记忆感知突触:学习该(不)忘记什么。见《欧洲计算机视觉会议(ECCV)论文集》,第139 - 154页,2018年。

[4] R. Aljundi, P. Chakravarty, and T. Tuytelaars. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3366-3375, 2017.
[4] R. Aljundi、P. Chakravarty和T. Tuytelaars。专家门控:使用专家网络进行终身学习。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第3366 - 3375页,2017年。

[5] R. Aljundi, K. Kelchtermans, and T. Tuytelaars. Task-free continual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11254-11263, 2019.
[5] R. Aljundi、K. Kelchtermans和T. Tuytelaars。无任务持续学习。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第11254 - 11263页,2019年。

[6] R. Aljundi and T. Tuytelaars. Lightweight unsupervised domain adaptation by convolutional filter reconstruction. In European Conference on Computer Vision, pages 508-515. Springer, 2016.
[6] R. Aljundi和T. Tuytelaars。通过卷积滤波器重建实现轻量级无监督域适应。见《欧洲计算机视觉会议》,第508 - 515页。施普林格出版社,2016年。

[7] G. Angeletti, B. Caputo, and T. Tommasi. Adaptive deep learning through visual domain localization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7135-7142. IEEE, 2018.
[7] G. Angeletti、B. Caputo和T. Tommasi。通过视觉域定位实现自适应深度学习。见《2018年电气与电子工程师协会国际机器人与自动化会议(ICRA)》,第7135 - 7142页。电气与电子工程师协会,2018年。

[8] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425-2433, 2015.
[8] S. Antol、A. Agrawal、J. Lu、M. Mitchell、D. Batra、C. Lawrence Zitnick和D. Parikh。视觉问答(VQA)。见《电气与电子工程师协会国际计算机视觉会议论文集》,第2425 - 2433页,2015年。

[9] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5):898-916, 2010.
[9] P. Arbelaez、M. Maire、C. Fowlkes和J. Malik。轮廓检测与分层图像分割。《电气与电子工程师协会模式分析与机器智能汇刊》,33(5):898 - 916,2010年。

[10] Y. Balaji, S. Sankaranarayanan, and R. Chellappa. Metareg: Towards domain generalization using meta-regularization. In Advances in Neural Information Processing Systems, pages 998-1008, 2018.
[10] Y. Balaji、S. Sankaranarayanan和R. Chellappa。元正则化(Metareg):使用元正则化实现域泛化。见《神经信息处理系统进展》,第998 - 1008页,2018年。

[11] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk. Learning local feature descriptors with triplets and shallow convolutional neural networks. In Bmvc, volume 1, page 3, 2016.
[11] V. Balntas、E. Riba、D. Ponsa和K. Mikolajczyk。使用三元组和浅层卷积神经网络学习局部特征描述符。见《英国机器视觉会议(Bmvc)》,第1卷,第3页,2016年。

[12] A. G. Banerjee, A. Barnes, K. N. Kaipa, J. Liu, S. Shriyam, N. Shah, and S. K. Gupta. An ontology to enable optimized task partitioning in human-robot collaboration for warehouse kitting operations. In Next-Generation Robotics II; and Machine Intelligence and Bio-inspired Computation: Theory and Applications IX, volume 9494, page 94940H. International Society for Optics and Photonics, 2015.
[12] A. G. Banerjee、A. Barnes、K. N. Kaipa、J. Liu、S. Shriyam、N. Shah和S. K. Gupta。一种用于实现仓库成套作业中人机协作优化任务划分的本体。见《下一代机器人II;以及机器智能与仿生计算:理论与应用IX》,第9494卷,第94940H页。国际光学与光子学学会,2015年。

[13] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei. What's the point: Semantic segmentation with point supervision. In European conference on computer vision, pages 549-565. Springer, 2016.
[13] A. Bearman、O. Russakovsky、V. Ferrari和L. Fei - Fei。点的意义:基于点监督的语义分割。见《欧洲计算机视觉会议》,第549 - 565页。施普林格出版社,2016年。

[14] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151-175, 2010.
[14] S. Ben - David、J. Blitzer、K. Crammer、A. Kulesza、F. Pereira和J. W. Vaughan。不同领域学习理论。《机器学习》,79(1 - 2):151 - 175,2010年。

[15] A. Bendale and T. Boult. Towards open world recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1893-1902, 2015.
[15] A. 本代尔(A. Bendale)和 T. 博尔特(T. Boult)。迈向开放世界识别。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 1893 - 1902 页,2015 年。

[16] Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
[16] Y. 本吉奥(Y. Bengio)、N. 伦纳德(N. Léonard)和 A. 库尔维尔(A. Courville)。通过随机神经元估计或传播梯度以进行条件计算。预印本 arXiv:1308.3432,2013 年。

[17] R. Berriel, S. Lathuillere, M. Nabi, T. Klein, T. Oliveira-Santos, N. Sebe, and E. Ricci. Budget-aware adapters for multi-domain learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 382-391, 2019.
[17] R. 贝里尔(R. Berriel)、S. 拉图伊勒(S. Lathuillere)、M. 纳比(M. Nabi)、T. 克莱因(T. Klein)、T. 奥利维拉 - 桑托斯(T. Oliveira - Santos)、N. 塞贝(N. Sebe)和 E. 里奇(E. Ricci)。用于多领域学习的预算感知适配器。见《电气与电子工程师协会国际计算机视觉会议论文集》,第 382 - 391 页,2019 年。

[18] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3034-3042, 2016.
[18] H. 比伦(H. Bilen)、B. 费尔南多(B. Fernando)、E. 加夫维斯(E. Gavves)、A. 韦尔代利(A. Vedaldi)和 S. 古尔德(S. Gould)。用于动作识别的动态图像网络。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 3034 - 3042 页,2016 年。

[19] H. Bilen and A. Vedaldi. Universal representations: The missing link between faces, text, planktons, and cat breeds. arXiv preprint arXiv:1701.07275, 2017.
[19] H. 比伦(H. Bilen)和 A. 韦尔代利(A. Vedaldi)。通用表示:人脸、文本、浮游生物和猫品种之间缺失的环节。预印本 arXiv:1701.07275,2017 年。

[20] N. Bjorck, C. P. Gomes, B. Selman, and K. Q. Weinberger. Understanding batch normalization. In Advances in Neural Information Processing Systems, pages 7694-7705, 2018.
[20] N. 比约克(N. Bjorck)、C. P. 戈麦斯(C. P. Gomes)、B. 塞尔曼(B. Selman)和 K. Q. 温伯格(K. Q. Weinberger)。理解批量归一化。见《神经信息处理系统进展》,第 7694 - 7705 页,2018 年。

[21] L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010, pages 177-186. Springer, 2010.
[21] L. 博托(L. Bottou)。使用随机梯度下降进行大规模机器学习。见《2010 年计算统计学会议论文集》,第 177 - 186 页。施普林格出版社,2010 年。

[22] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3722-3731, 2017.
[22] K. 布斯马利斯(K. Bousmalis)、N. 西尔伯曼(N. Silberman)、D. 多汉(D. Dohan)、D. 埃尔汉(D. Erhan)和 D. 克里什南(D. Krishnan)。使用生成对抗网络进行无监督像素级领域适应。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 3722 - 3731 页,2017 年。

[23] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3722-3731, 2017.
[23] K. 布斯马利斯(K. Bousmalis)、N. 西尔伯曼(N. Silberman)、D. 多汉(D. Dohan)、D. 埃尔汉(D. Erhan)和 D. 克里什南(D. Krishnan)。使用生成对抗网络进行无监督像素级领域适应。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 3722 - 3731 页,2017 年。

[24] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In Advances in neural information processing systems, pages 343-351, 2016.
[24] K. 布斯马利斯(K. Bousmalis)、G. 特里戈吉斯(G. Trigeorgis)、N. 西尔伯曼(N. Silberman)、D. 克里什南(D. Krishnan)和 D. 埃尔汉(D. Erhan)。领域分离网络。见《神经信息处理系统进展》,第 343 - 351 页,2016 年。

[25] R. Camoriano, G. Pasquale, C. Ciliberto, L. Natale, L. Rosasco, and G. Metta. Incremental robot learning of new objects with fixed update time. In 2017 International Conference on Robotics and Automation (ICRA), pages 3207- 3214, 2017.
[25] R. 卡莫里亚诺(R. Camoriano)、G. 帕斯夸莱(G. Pasquale)、C. 奇利贝托(C. Ciliberto)、L. 纳塔莱(L. Natale)、L. 罗萨斯科(L. Rosasco)和 G. 梅塔(G. Metta)。具有固定更新时间的机器人对新物体的增量学习。见《2017 年国际机器人与自动化会议论文集》,第 3207 - 3214 页,2017 年。

[26] R. Camoriano, S. Traversaro, L. Rosasco, G. Metta, and F. Nori. Incremental semiparametric inverse dynamics learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 544-550. IEEE, 2016.
[26] R. 卡莫里亚诺(R. Camoriano)、S. 特拉韦萨罗(S. Traversaro)、L. 罗萨斯科(L. Rosasco)、G. 梅塔(G. Metta)和 F. 诺里(F. Nori)。增量半参数逆动力学学习。见《2016 年电气与电子工程师协会国际机器人与自动化会议论文集》,第 544 - 550 页。电气与电子工程师协会,2016 年。

[27] F. M. Carlucci, A. D'Innocente, S. Bucci, B. Caputo, and T. Tommasi. Domain generalization by solving jigsaw puzzles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2229-2238, 2019.
[27] F. M. 卡尔卢奇(F. M. Carlucci)、A. 迪诺森特(A. D'Innocente)、S. 布奇(S. Bucci)、B. 卡普托(B. Caputo)和 T. 托马西(T. Tommasi)。通过解决拼图游戏进行领域泛化。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 2229 - 2238 页,2019 年。

[28] F. M. Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulo. Autodial: Automatic domain alignment layers. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5077-5085. IEEE, 2017.
[28] F. M. 卡尔卢奇(F. M. Carlucci)、L. 波尔齐(L. Porzi)、B. 卡普托(B. Caputo)、E. 里奇(E. Ricci)和 S. R. 布洛(S. R. Bulo)。自动领域对齐层(Autodial)。见《2017 年电气与电子工程师协会国际计算机视觉会议论文集》,第 5077 - 5085 页。电气与电子工程师协会,2017 年。

[29] F. M. Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulo. Just dial: Domain alignment layers for unsupervised domain adaptation. In International Conference on Image Analysis and Processing, pages 357-369. Springer, 2017.
[29] F. M. 卡尔卢奇(F. M. Carlucci)、L. 波尔齐(L. Porzi)、B. 卡普托(B. Caputo)、E. 里奇(E. Ricci)和 S. R. 布洛(S. R. Bulo)。只需对齐:用于无监督领域适应的领域对齐层。见《国际图像分析与处理会议论文集》,第 357 - 369 页。施普林格出版社,2017 年。

[30] F. M. Castro, M. J. Marín-Jiménez, N. Guil, C. Schmid, and K. Alahari. End-to-end incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 233-248, 2018.
[30] F. M. 卡斯特罗(F. M. Castro)、M. J. 马林 - 希门尼斯(M. J. Marín - Jiménez)、N. 吉尔(N. Guil)、C. 施密德(C. Schmid)和 K. 阿拉哈里(K. Alahari)。端到端增量学习。见《欧洲计算机视觉会议论文集》,第 233 - 248 页,2018 年。

[31] F. Cermelli, M. Mancini, S. R. Bulò, E. Ricci, and B. Caputo. Modeling the background for incremental learning in semantic segmentation. The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[31] F. 切尔梅利(F. Cermelli)、M. 曼奇尼(M. Mancini)、S. R. 布卢(S. R. Bulò)、E. 里奇(E. Ricci)和 B. 卡普托(B. Caputo)。语义分割中增量学习的背景建模。电气与电子工程师协会/计算机视觉基金会计算机视觉与模式识别会议(IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR),2020 年。

[32] F. Cermelli, M. Mancini, E. Ricci, and B. Caputo. The rgb-d triathlon: Towards agile visual toolboxes for robots. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019.
[32] F. 切尔梅利(F. Cermelli)、M. 曼奇尼(M. Mancini)、E. 里奇(E. Ricci)和 B. 卡普托(B. Caputo)。RGB - D 铁人三项:迈向适用于机器人的敏捷视觉工具箱。电气与电子工程师协会/日本机器人协会智能机器人与系统国际会议(IEEE/RSJ International Conference on Intelligent Robots and Systems,IROS),2019 年。

[33] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody dance now. In Proceedings of the IEEE International Conference on Computer Vision, pages 5933-5942, 2019.
[33] C. 陈(C. Chan)、S. 吉诺萨尔(S. Ginosar)、T. 周(T. Zhou)和 A. A. 埃弗罗斯(A. A. Efros)。现在人人都跳舞。电气与电子工程师协会国际计算机视觉会议论文集,第 5933 - 5942 页,2019 年。

[34] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha. Synthesized classifiers for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5327-5336, 2016.
[34] S. 张平约(S. Changpinyo)、W. - L. 赵(W. - L. Chao)、B. 龚(B. Gong)和 F. 沙(F. Sha)。零样本学习的合成分类器。电气与电子工程师协会计算机视觉与模式识别会议论文集,第 5327 - 5336 页,2016 年。

[35] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In Proceedings of the British Machine Vision Conference, 2014.
[35] K. 查特菲尔德(K. Chatfield)、K. 西蒙扬(K. Simonyan)、A. 韦达利(A. Vedaldi)和 A. 齐斯曼(A. Zisserman)。细节中魔鬼的回归:深入研究卷积网络。英国机器视觉会议论文集,2014 年。

[36] A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), pages 532-547, 2018.
[36] A. 乔杜里(A. Chaudhry)、P. K. 多卡尼亚(P. K. Dokania)、T. 阿詹坦(T. Ajanthan)和 P. H. 托尔(P. H. Torr)。增量学习的黎曼漫步:理解遗忘和顽固性。欧洲计算机视觉会议(European Conference on Computer Vision,ECCV)论文集,第 532 - 547 页,2018 年。

[37] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834-848, 2017.
[37] L. - C. 陈(L. - C. Chen)、G. 帕潘德里欧(G. Papandreou)、I. 科基诺斯(I. Kokkinos)、K. 墨菲(K. Murphy)和 A. L. 尤利尔(A. L. Yuille)。深度实验室(Deeplab):基于深度卷积网络、空洞卷积和全连接条件随机场的语义图像分割。电气与电子工程师协会模式分析与机器智能汇刊,40(4):834 - 848,2017 年。

[38] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
[38] L. - C. 陈(L. - C. Chen)、G. 帕潘德里欧(G. Papandreou)、F. 施罗夫(F. Schroff)和 H. 亚当(H. Adam)。重新思考语义图像分割中的空洞卷积。预印本 arXiv:1706.05587,2017 年。

[39] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3640-3649, 2016.
[39] L. - C. 陈(L. - C. Chen)、Y. 杨(Y. Yang)、J. 王(J. Wang)、W. 徐(W. Xu)和 A. L. 尤利尔(A. L. Yuille)。关注尺度:尺度感知的语义图像分割。电气与电子工程师协会计算机视觉与模式识别会议论文集,第 3640 - 3649 页,2016 年。

[40] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801-818, 2018.
[40] L. - C. 陈(L. - C. Chen)、Y. 朱(Y. Zhu)、G. 帕潘德里欧(G. Papandreou)、F. 施罗夫(F. Schroff)和 H. 亚当(H. Adam)。基于空洞可分离卷积的编码器 - 解码器用于语义图像分割。欧洲计算机视觉会议(European conference on computer vision,ECCV)论文集,第 801 - 818 页,2018 年。

[41] X. Chen and A. Gupta. Webly supervised learning of convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1431-1439, 2015.
[41] X. 陈(X. Chen)和 A. 古普塔(A. Gupta)。卷积网络的网络监督学习。电气与电子工程师协会国际计算机视觉会议论文集,第 1431 - 1439 页,2015 年。

[42] Z. Chen, A. Jacobson, N. Sunderhauf, B. Upcroft, L. Liu, C. Shen, I. Reid, and M. Milford. Deep learning features at scale for visual place recognition. arXiv preprint arXiv:1701.05105, 2017.
[42] Z. 陈(Z. Chen)、A. 雅各布森(A. Jacobson)、N. 桑德豪夫(N. Sunderhauf)、B. 厄普克罗夫特(B. Upcroft)、L. 刘(L. Liu)、C. 沈(C. Shen)、I. 里德(I. Reid)和 M. 米尔福德(M. Milford)。用于视觉场所识别的大规模深度学习特征。预印本 arXiv:1701.05105,2017 年。

[43] Z. Chen, J. Zhuang, X. Liang, and L. Lin. Blending-target domain adaptation by adversarial meta-adaptation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2248-2257, 2019.
[43] Z. 陈(Z. Chen)、J. 庄(J. Zhuang)、X. 梁(X. Liang)和 L. 林(L. Lin)。通过对抗元适应网络进行混合目标域适应。电气与电子工程师协会计算机视觉与模式识别会议论文集,第 2248 - 2257 页,2019 年。

[44] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3606-3613, 2014.
[44] M. 钦波伊(M. Cimpoi)、S. 马吉(S. Maji)、I. 科基诺斯(I. Kokkinos)、S. 穆罕默德(S. Mohamed)和 A. 韦达利(A. Vedaldi)。野外纹理描述。电气与电子工程师协会计算机视觉与模式识别会议论文集,第 3606 - 3613 页,2014 年。

[45] R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160-167, 2008.
[45] R. 科洛贝尔(R. Collobert)和 J. 韦斯顿(J. Weston)。自然语言处理的统一架构:具有多任务学习的深度神经网络。第 25 届国际机器学习会议论文集,第 160 - 167 页,2008 年。

[46] G. Costante, T. A. Ciarfuglia, P. Valigi, and E. Ricci. A transfer learning approach for multi-cue semantic place recognition. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2122-2129. IEEE, 2013.
[46] G. 科斯坦特(G. Costante)、T. A. 恰尔富利亚(T. A. Ciarfuglia)、P. 瓦利吉(P. Valigi)和 E. 里奇(E. Ricci)。用于多线索语义场所识别的迁移学习方法。2013 年电气与电子工程师协会/日本机器人协会智能机器人与系统国际会议,第 2122 - 2129 页。电气与电子工程师协会,2013 年。

[47] K. Crammer, M. Kearns, and J. Wortman. Learning from multiple sources. Journal of Machine Learning Research, 9(Aug):1757-1774, 2008.
[47] K. 克拉默(K. Crammer)、M. 卡恩斯(M. Kearns)和 J. 沃特曼(J. Wortman)。从多个来源学习。《机器学习研究杂志》,9(8 月):1757 - 1774,2008 年。

[48] G. Csurka. Domain adaptation for visual applications: A comprehensive survey. arXiv preprint arXiv:1702.05374, 2017.
[48] G. 丘尔卡(G. Csurka)。视觉应用的领域自适应:全面综述。预印本 arXiv:1702.05374,2017 年。

[49] M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars. Continual learning: A comparative study on how to defy forgetting in classification tasks. arXiv preprint arXiv:1909.08383, 2019.
[49] M. 德·兰格(M. De Lange)、R. 阿尔琼迪(R. Aljundi)、M. 马萨纳(M. Masana)、S. 帕里索(S. Parisot)、X. 贾(X. Jia)、A. 伦纳迪斯(A. Leonardis)、G. 斯拉博(G. Slabaugh)和 T. 蒂特拉尔斯(T. Tuytelaars)。持续学习:关于如何在分类任务中克服遗忘的比较研究。预印本 arXiv:1909.08383,2019 年。

[50] R. De Rosa, T. Mensink, and B. Caputo. Online open world recognition. arXiv:1604.02275, 2016.
[50] R. 德·罗莎(R. De Rosa)、T. 门辛克(T. Mensink)和 B. 卡普托(B. Caputo)。在线开放世界识别。arXiv:1604.02275,2016 年。

[51] L. Deecke, I. Murray, and H. Bilen. Mode normalization. In International Conference on Learning Representations, 2018.
[51] L. 迪克(L. Deecke)、I. 默里(I. Murray)和 H. 比伦(H. Bilen)。模式归一化。《国际学习表征会议》,2018 年。

[52] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248-255. Ieee, 2009.
[52] J. 邓(J. Deng)、W. 董(W. Dong)、R. 索切尔(R. Socher)、L.-J. 李(L.-J. Li)、K. 李(K. Li)和 L. 费费(L. Fei - Fei)。ImageNet:大规模分层图像数据库。《2009 年 IEEE 计算机视觉与模式识别会议》,第 248 - 255 页。IEEE,2009 年。

[53] L. Deng, G. Hinton, and B. Kingsbury. New types of deep neural network learning for speech recognition and related applications: An overview. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8599-8603. IEEE, 2013.
[53] L. 邓(L. Deng)、G. 辛顿(G. Hinton)和 B. 金斯伯里(B. Kingsbury)。用于语音识别及相关应用的新型深度神经网络学习:综述。《2013 年 IEEE 国际声学、语音和信号处理会议》,第 8599 - 8603 页。IEEE,2013 年。

[54] L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams, et al. Recent advances in deep learning for speech research at microsoft. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8604-8608. IEEE, 2013.
[54] L. 邓(L. Deng)、J. 李(J. Li)、J.-T. 黄(J.-T. Huang)、K. 姚(K. Yao)、D. 于(D. Yu)、F. 塞德(F. Seide)、M. 塞尔策(M. Seltzer)、G. 茨威格(G. Zweig)、X. 何(X. He)、J. 威廉姆斯(J. Williams)等。微软在语音研究中深度学习的最新进展。《2013 年 IEEE 国际声学、语音和信号处理会议》,第 8604 - 8608 页。IEEE,2013 年。

[55] L. Deng and Y. Liu. Deep learning in natural language processing. Springer, 2018.
[55] L. 邓(L. Deng)和 Y. 刘(Y. Liu)。自然语言处理中的深度学习。施普林格出版社,2018 年。

[56] P. Dhar, R. V. Singh, K.-C. Peng, Z. Wu, and R. Chellappa. Learning without memorizing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5138-5146, 2019.
[56] P. 达尔(P. Dhar)、R. V. 辛格(R. V. Singh)、K.-C. 彭(K.-C. Peng)、Z. 吴(Z. Wu)和 R. 切拉帕(R. Chellappa)。无记忆学习。《IEEE 计算机视觉与模式识别会议论文集》,第 5138 - 5146 页,2019 年。

[57] S. K. Divvala, A. Farhadi, and C. Guestrin. Learning everything about anything: Webly-supervised visual concept learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3270- 3277, 2014.
[57] S. K. 迪瓦拉(S. K. Divvala)、A. 法尔哈迪(A. Farhadi)和 C. 格斯特林(C. Guestrin)。学习关于任何事物的一切:网络监督的视觉概念学习。《IEEE 计算机视觉与模式识别会议论文集》,第 3270 - 3277 页,2014 年。

[58] K. Dmitriev and A. E. Kaufman. Learning multi-class segmentations from single-class datasets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9501-9511, 2019.
[58] K. 德米特里耶夫(K. Dmitriev)和 A. E. 考夫曼(A. E. Kaufman)。从单类数据集学习多类分割。《IEEE 计算机视觉与模式识别会议论文集》,第 9501 - 9511 页,2019 年。

[59] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647-655, 2014.
[59] J. 多纳休(J. Donahue)、Y. 贾(Y. Jia)、O. 维尼亚尔斯(O. Vinyals)、J. 霍夫曼(J. Hoffman)、N. 张(N. Zhang)、E. 曾(E. Tzeng)和 T. 达雷尔(T. Darrell)。DeCAF:用于通用视觉识别的深度卷积激活特征。《国际机器学习会议》,第 647 - 655 页,2014 年。

[60] L. Duan, I. W. Tsang, D. Xu, and T.-S. Chua. Domain adaptation from multiple sources via auxiliary classifiers. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 289-296, 2009.
[60] L. 段(L. Duan)、I. W. 曾(I. W. Tsang)、D. 徐(D. Xu)和 T.-S. 蔡(T.-S. Chua)。通过辅助分类器进行多源领域自适应。《第 26 届年度国际机器学习会议论文集》,第 289 - 296 页,2009 年。

[61] A. Dutta and Z. Akata. Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5089-5098, 2019.
[61] A. 杜塔(A. Dutta)和 Z. 阿卡塔(Z. Akata)。基于零样本草图的图像检索的语义关联配对循环一致性。《IEEE 计算机视觉与模式识别会议论文集》,第 5089 - 5098 页,2019 年。

[62] A. D'Innocente and B. Caputo. Domain generalization with domain-specific aggregation modules. In German Conference on Pattern Recognition, pages 187-198. Springer, 2018.
[62] A. 迪诺森特(A. D'Innocente)和 B. 卡普托(B. Caputo)。使用特定领域聚合模块进行领域泛化。《德国模式识别会议》,第 187 - 198 页。施普林格出版社,2018 年。

[63] M. Eitz, J. Hays, and M. Alexa. How do humans sketch objects? ACM Transactions on Graphics, 31(4):44-1, 2012.
[63] M. 埃茨(M. Eitz)、J. 海斯(J. Hays)和 M. 亚历克萨(M. Alexa)。人类如何绘制物体草图?《ACM图形学汇刊》(ACM Transactions on Graphics),31(4):44 - 1,2012年。

[64] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
[64] M. 埃弗里ingham(M. Everingham)、L. 范古尔(L. Van Gool)、C. K. I. 威廉姆斯(C. K. I. Williams)、J. 温(J. Winn)和 A. 齐斯曼(A. Zisserman)。PASCAL视觉对象类挑战赛2012(VOC2012)结果。http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html。

[65] E. Fazl-Ersi and J. K. Tsotsos. Histogram of oriented uniform patterns for robust place recognition and categorization. The International Journal of Robotics Research, 31(4):468-483, 2012.
[65] E. 法兹尔 - 埃尔西(E. Fazl - Ersi)和 J. K. 索托索斯(J. K. Tsotsos)。用于稳健场所识别和分类的定向均匀模式直方图。《国际机器人研究杂志》(The International Journal of Robotics Research),31(4):468 - 483,2012年。

[66] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594-611, 2006.
[66] L. 费菲菲(L. Fei - Fei)、R. 弗格斯(R. Fergus)和 P. 佩罗纳(P. Perona)。物体类别的一次性学习。《IEEE模式分析与机器智能汇刊》(IEEE transactions on pattern analysis and machine intelligence),28(4):594 - 611,2006年。

[67] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised visual domain adaptation using subspace alignment. In Proceedings of the IEEE international conference on computer vision, pages 2960-2967, 2013.
[67] B. 费尔南多(B. Fernando)、A. 哈布拉德(A. Habrard)、M. 塞班(M. Sebban)和 T. 蒂特拉尔斯(T. Tuytelaars)。使用子空间对齐的无监督视觉领域自适应。《IEEE国际计算机视觉会议论文集》(Proceedings of the IEEE international conference on computer vision),第2960 - 2967页,2013年。

[68] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017.
[68] C. 芬恩(C. Finn)、P. 阿贝贝尔(P. Abbeel)和 S. 莱文(S. Levine)。用于深度网络快速自适应的与模型无关的元学习。《国际机器学习会议》(International Conference on Machine Learning),2017年。

[69] D. Fontanel, F. Cermelli, M. Mancini, S. R. Bulò, E. Ricci, and B. Caputo. Boosting deep open world recognition by clustering. technical report, 2020.
[69] D. 丰塔内尔(D. Fontanel)、F. 切尔梅利(F. Cermelli)、M. 曼奇尼(M. Mancini)、S. R. 布洛(S. R. Bulò)、E. 里奇(E. Ricci)和 B. 卡普托(B. Caputo)。通过聚类提升深度开放世界识别能力。技术报告,2020年。

[70] V. Fragoso, P. Sen, S. Rodriguez, and M. Turk. Evsac: accelerating hypotheses generation by modeling matching scores with extreme value theory. In Proceedings of the IEEE International Conference on Computer Vision, pages 2472-2479, 2013.
[70] V. 弗拉戈索(V. Fragoso)、P. 森(P. Sen)、S. 罗德里格斯(S. Rodriguez)和 M. 图尔克(M. Turk)。Evsac:通过用极值理论对匹配分数进行建模来加速假设生成。《IEEE国际计算机视觉会议论文集》(Proceedings of the IEEE International Conference on Computer Vision),第2472 - 2479页,2013年。

[71] R. M. French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128-135, 1999.
[71] R. M. 弗伦奇(R. M. French)。连接主义网络中的灾难性遗忘。《认知科学趋势》(Trends in cognitive sciences),3(4):128 - 135,1999年。

[72] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
[72] J. 弗里德曼(J. Friedman)、T. 哈斯蒂(T. Hastie)和 R. 蒂布希拉尼(R. Tibshirani)。《统计学习基础》(The elements of statistical learning),第1卷。施普林格统计学系列,纽约,2001年。

[73] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, pages 2121-2129, 2013.
[73] A. 弗罗姆(A. Frome)、G. S. 科拉多(G. S. Corrado)、J. 什伦斯(J. Shlens)、S. 本吉奥(S. Bengio)、J. 迪恩(J. Dean)、M. 兰扎托(M. Ranzato)和 T. 米科洛夫(T. Mikolov)。Devise:一种深度视觉 - 语义嵌入模型。《神经信息处理系统进展》(Advances in neural information processing systems),第2121 - 2129页,2013年。

[74] N. Frosst, N. Papernot, and G. Hinton. Analyzing and improving representations with the soft nearest neighbor loss. In International Conference on Machine Learning, pages 2012-2020, 2019.
[74] N. 弗罗斯特(N. Frosst)、N. 帕佩诺(N. Papernot)和 G. 辛顿(G. Hinton)。使用软最近邻损失分析和改进表示。《国际机器学习会议》(International Conference on Machine Learning),第2012 - 2020页,2019年。

[75] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Transductive multi-view zero-shot learning. IEEE transactions on pattern analysis and machine intelligence, 37(11):2332-2345, 2015.
[75] Y. 傅(Y. Fu)、T. M. 霍斯佩代尔斯(T. M. Hospedales)、T. 向(T. Xiang)和 S. 龚(S. Gong)。直推式多视图零样本学习。《IEEE模式分析与机器智能汇刊》(IEEE transactions on pattern analysis and machine intelligence),37(11):2332 - 2345,2015年。

[76] C. Gan, T. Yang, and B. Gong. Learning attributes equals multi-source domain generalization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 87-97, 2016.
[76] C. 甘(C. Gan)、T. 杨(T. Yang)和 B. 龚(B. Gong)。学习属性等同于多源领域泛化。《IEEE计算机视觉与模式识别会议论文集》(Proceedings of the IEEE conference on computer vision and pattern recognition),第87 - 97页,2016年。

[77] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropa-gation. In International Conference on Machine Learning, pages 1180-1189, 2015.
[77] Y. 加宁(Y. Ganin)和 V. 伦皮茨基(V. Lempitsky)。通过反向传播进行无监督领域自适应。《国际机器学习会议》(International Conference on Machine Learning),第1180 - 1189页,2015年。

[78] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096-2030, 2016.
[78] Y. 加宁(Y. Ganin)、E. 乌斯蒂诺娃(E. Ustinova)、H. 阿贾坎(H. Ajakan)、P. 热尔曼(P. Germain)、H. 拉罗谢尔(H. Larochelle)、F. 拉维奥莱特(F. Laviolette)、M. 马尔尚(M. Marchand)和 V. 伦皮茨基(V. Lempitsky)。神经网络的领域对抗训练。《机器学习研究杂志》(The Journal of Machine Learning Research),17(1):2096 - 2030,2016年。

[79] M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi. Domain generalization for object recognition with multi-task autoencoders. In Proceedings of the IEEE international conference on computer vision, pages 2551-2559, 2015.
[79] M. 吉法里(M. Ghifary)、W. 巴斯蒂安·克莱因(W. Bastiaan Kleijn)、M. 张(M. Zhang)和 D. 巴尔杜齐(D. Balduzzi)。使用多任务自动编码器进行目标识别的领域泛化。见《电气与电子工程师协会国际计算机视觉会议论文集》,第 2551 - 2559 页,2015 年。

[80] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li. Deep reconstruction-classification networks for unsupervised domain adaptation. In European Conference on Computer Vision, pages 597-613. Springer, 2016.
[80] M. 吉法里(M. Ghifary)、W. B. 克莱因(W. B. Kleijn)、M. 张(M. Zhang)、D. 巴尔杜齐(D. Balduzzi)和 W. 李(W. Li)。用于无监督领域自适应的深度重建 - 分类网络。见《欧洲计算机视觉会议》,第 597 - 613 页。施普林格出版社,2016 年。

[81] B. Gholami, P. Sahu, O. Rudovic, K. Bousmalis, and V. Pavlovic. Unsupervised multi-target domain adaptation: An information theoretic approach. IEEE Transactions on Image Processing, 29:3993-4002, 2020.
[81] B. 戈拉米(B. Gholami)、P. 萨胡(P. Sahu)、O. 鲁多维奇(O. Rudovic)、K. 布斯马利斯(K. Bousmalis)和 V. 帕夫洛维奇(V. Pavlovic)。无监督多目标领域自适应:一种信息论方法。《电气与电子工程师协会图像处理汇刊》,29:3993 - 4002,2020 年。

[82] S. Ginosar, K. Rakelly, S. Sachs, B. Yin, and A. A. Efros. A century of portraits: A visual historical record of american high school yearbooks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 17,2015 .
[82] S. 吉诺萨尔(S. Ginosar)、K. 拉凯利(K. Rakelly)、S. 萨克斯(S. Sachs)、B. 尹(B. Yin)和 A. A. 埃弗罗斯(A. A. Efros)。一个世纪的肖像:美国高中年鉴的视觉历史记录。见《电气与电子工程师协会国际计算机视觉研讨会论文集》,第 17,2015 页。

[83] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580-587, 2014.
[83] R. 吉尔希克(R. Girshick)、J. 多纳休(J. Donahue)、T. 达雷尔(T. Darrell)和 J. 马利克(J. Malik)。用于精确目标检测和语义分割的丰富特征层次结构。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 580 - 587 页,2014 年。

[84] B. Gong, K. Grauman, and F. Sha. Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation. In International Conference on Machine Learning, pages 222-230, 2013.
[84] B. 龚(B. Gong)、K. 格劳曼(K. Grauman)和 F. 沙(F. Sha)。用地标连接点:为无监督领域自适应判别式学习领域不变特征。见《国际机器学习会议》,第 222 - 230 页,2013 年。

[85] B. Gong, K. Grauman, and F. Sha. Reshaping visual datasets for domain adaptation. In Advances in Neural Information Processing Systems, pages 1286-1294, 2013.
[85] B. 龚(B. Gong)、K. 格劳曼(K. Grauman)和 F. 沙(F. Sha)。为领域自适应重塑视觉数据集。见《神经信息处理系统进展》,第 1286 - 1294 页,2013 年。

[86] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2066-2073. IEEE, 2012.
[86] B. 龚(B. Gong)、Y. 施(Y. Shi)、F. 沙(F. Sha)和 K. 格劳曼(K. Grauman)。用于无监督领域自适应的测地流核。见 2012 年《电气与电子工程师协会计算机视觉与模式识别会议》,第 2066 - 2073 页。电气与电子工程师协会,2012 年。

[87] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
[87] I. 古德费洛(I. Goodfellow)、Y. 本吉奥(Y. Bengio)和 A. 库尔维尔(A. Courville)。《深度学习》。麻省理工学院出版社,2016 年。http://www.deeplearningbook.org。

[88] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672-2680, 2014.
[88] I. 古德费洛(I. Goodfellow)、J. 普热 - 阿巴迪(J. Pouget - Abadie)、M. 米尔扎(M. Mirza)、B. 徐(B. Xu)、D. 沃德 - 法利(D. Warde - Farley)、S. 奥扎尔(S. Ozair)、A. 库尔维尔(A. Courville)和 Y. 本吉奥(Y. Bengio)。生成对抗网络。见《神经信息处理系统进展》,第 2672 - 2680 页,2014 年。

[89] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
[89] I. J. 古德费洛(I. J. Goodfellow)、M. 米尔扎(M. Mirza)、D. 肖(D. Xiao)、A. 库尔维尔(A. Courville)和 Y. 本吉奥(Y. Bengio)。基于梯度的神经网络中灾难性遗忘的实证研究。预印本 arXiv:1312.6211,2013 年。

[90] R. M. Goodman and Z. Zeng. A learning algorithm for multi-layer perceptrons with hard-limiting threshold units. In Proceedings of IEEE Workshop on Neural Networks for Signal Processing, pages 219-228. IEEE, 1994.
[90] R. M. 古德曼(R. M. Goodman)和 Z. 曾(Z. Zeng)。具有硬限幅阈值单元的多层感知器学习算法。见《电气与电子工程师协会信号处理神经网络研讨会论文集》,第 219 - 228 页。电气与电子工程师协会,1994 年。

[91] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: An unsupervised approach. In Proceedings of the 2011 International Conference on Computer Vision, pages 999-1006, 2011.
[91] R. 戈帕兰(R. Gopalan)、R. 李(R. Li)和 R. 切拉帕(R. Chellappa)。目标识别领域自适应:一种无监督方法。见 2011 年《国际计算机视觉会议论文集》,第 999 - 1006 页,2011 年。

[92] R. Gopalan, R. Li, and R. Chellappa. Unsupervised adaptation across domain shifts by generating intermediate data representations. IEEE transactions on pattern analysis and machine intelligence, 36(11):2288-2302, 2013.
[92] R. 戈帕兰(R. Gopalan)、R. 李(R. Li)和 R. 切拉帕(R. Chellappa)。通过生成中间数据表示进行跨领域偏移的无监督自适应。《电气与电子工程师协会模式分析与机器智能汇刊》,36(11):2288 - 2302,2013 年。

[93] G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical report, California Institute of Technology, 2007.
[93] G. 格里芬(G. Griffin)、A. 霍卢布(A. Holub)和 P. 佩罗纳(P. Perona)。加州理工学院 256 目标类别数据集。技术报告,加州理工学院,2007 年。

[94] S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3389-3396. IEEE, 2017.
[94] S. 顾(S. Gu)、E. 霍利(E. Holly)、T. 利利克拉普(T. Lillicrap)和 S. 莱文(S. Levine)。用于机器人操作的深度强化学习与异步离策略更新。见 2017 年《电气与电子工程师协会国际机器人与自动化会议(ICRA)》,第 3389 - 3396 页。电气与电子工程师协会,2017 年。

[95] S. Guerriero, B. Caputo, and T. Mensink. Deep nearest class mean classifiers. In International Conference on Learning Representations, Worskhop Track, 2018.
[95] S. 格雷里耶罗(S. Guerriero)、B. 卡普托(B. Caputo)和 T. 门辛克(T. Mensink)。深度最近类均值分类器。见《国际学习表征会议(International Conference on Learning Representations)》研讨会环节,2018 年。

[96] Y. Guo, H. Shi, A. Kumar, K. Grauman, T. Rosing, and R. Feris. Spottune: transfer learning through adaptive fine-tuning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4805-4814, 2019.
[96] 郭(Guo)、施(Shi)、库马尔(Kumar)、格劳曼(Grauman)、罗辛(Rosing)和费里斯(Feris)。Spottune:通过自适应微调进行迁移学习。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第4805 - 4814页,2019年。

[97] P. Haeusser, T. Frerix, A. Mordvintsev, and D. Cremers. Associative domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2765-2773, 2017.
[97] P. 豪瑟(P. Haeusser)、T. 弗雷里克斯(T. Frerix)、A. 莫德温采夫(A. Mordvintsev)和 D. 克雷默斯(D. Cremers)。关联域适应。见《电气与电子工程师协会国际计算机视觉会议论文集》,第 2765 - 2773 页,2017 年。

[98] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016.
[98] 何恺明(K. He)、张祥雨(X. Zhang)、任少卿(S. Ren)和孙剑(J. Sun)。用于图像识别的深度残差学习。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第770 - 778页,2016年。

[99] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630-645. Springer, 2016.
[99] 何恺明(K. He)、张祥雨(X. Zhang)、任少卿(S. Ren)和孙剑(J. Sun)。深度残差网络中的恒等映射。见《欧洲计算机视觉会议论文集》,第630 - 645页。施普林格出版社,2016年。

[100] G. Hinton. Neural networks for machine learning, 2012. Coursera, video lectures.
[100] G. 辛顿(G. Hinton)。机器学习神经网络,2012年。Coursera平台视频讲座。

[101] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82-97, 2012.
[101] G. 辛顿(G. Hinton)、L. 邓(L. Deng)、D. 于(D. Yu)、G. E. 达尔(G. E. Dahl)、A.-r. 穆罕默德(A.-r. Mohamed)、N. 杰特利(N. Jaitly)、A. 西尼尔(A. Senior)、V. 范霍克(V. Vanhoucke)、P. 阮(P. Nguyen)、T. N. 赛纳特(T. N. Sainath)等。用于语音识别中声学建模的深度神经网络:四个研究小组的共同观点。《IEEE信号处理杂志》(IEEE Signal processing magazine),29(6):82 - 97,2012年。

[102] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
[102] G. 辛顿(G. Hinton)、O. 维尼亚尔斯(O. Vinyals)和 J. 迪恩(J. Dean)。神经网络知识蒸馏。预印本 arXiv:1503.02531,2015 年。

[103] J. Hoffman, T. Darrell, and K. Saenko. Continuous manifold based adaptation for evolving visual domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 867-874, 2014.
[103] J. 霍夫曼(J. Hoffman)、T. 达雷尔(T. Darrell)和 K. 申科(K. Saenko)。基于连续流形的演化视觉领域自适应方法。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 867 - 874 页,2014 年。

[104] J. Hoffman, B. Kulis, T. Darrell, and K. Saenko. Discovering latent domains for multisource domain adaptation. In European Conference on Computer Vision, pages 702-715. Springer, 2012.
[104] J. 霍夫曼(J. Hoffman)、B. 库利斯(B. Kulis)、T. 达雷尔(T. Darrell)和 K. 萨恩科(K. Saenko)。发现多源领域自适应的潜在领域。见《欧洲计算机视觉会议论文集》,第 702 - 715 页。施普林格出版社,2012 年。

[105] D. Holz, A. Topalidou-Kyniazopoulou, F. Rovida, M. R. Pedersen, V. Krüger, and S. Behnke. A skill-based system for object perception and manipulation for automating kitting tasks. In 2015 IEEE 20th Conference on Emerging Technologies & Factory Automation (ETFA), pages 1-9. IEEE, 2015.
[105] D. 霍尔茨(D. Holz)、A. 托帕利杜 - 基尼亚佐普卢(A. Topalidou-Kyniazopoulou)、F. 罗维达(F. Rovida)、M. R. 佩德森(M. R. Pedersen)、V. 克吕格(V. Krüger)和 S. 贝恩克(S. Behnke)。一种基于技能的对象感知与操作自动化配套任务系统。见《2015 年 IEEE 第 20 届新兴技术与工厂自动化会议(ETFA)论文集》,第 1 - 9 页。电气与电子工程师协会(IEEE),2015 年。

[106] S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin. Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 831-839, 2019.
[106] 侯(Hou)、潘(Pan)、洛伊(Loy)、王(Wang)和林(Lin)。通过重新平衡增量式学习统一分类器。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第831 - 839页,2019年。

[107] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700-4708, 2017.
[107] G. 黄(Huang)、Z. 刘(Liu)、L. 范德马滕(Van Der Maaten)和 K. Q. 温伯格(Weinberger)。密集连接卷积网络。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 4700 - 4708 页,2017 年。

[108] J. Huang, A. Gretton, K. Borgwardt, B. Schölkopf, and A. J. Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pages 601-608, 2007.
[108] 黄(Huang)、格雷顿(Gretton)、博格瓦尔特(Borgwardt)、肖尔科普夫(Schölkopf)和斯莫拉(Smola)。利用无标签数据纠正样本选择偏差。《神经信息处理系统进展》,第601 - 608页,2007年。

[109] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448-456, 2015.
[109] S. 约费(S. Ioffe)和 C. 塞格迪(C. Szegedy)。批量归一化:通过减少内部协变量偏移加速深度网络训练。见《国际机器学习会议论文集》,第 448 - 456 页,2015 年。

[110] N. Jacobs, N. Roman, and R. Pless. Consistent temporal variations in many outdoor scenes. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1-6. IEEE, 2007.
[110] N. 雅各布斯(N. Jacobs)、N. 罗曼(N. Roman)和 R. 普莱斯(R. Pless)。许多户外场景中的一致时间变化。见《2007 年电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 1 - 6 页。电气与电子工程师协会,2007 年。

[111] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar-rama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675-678, 2014.
[111] 贾宇(Y. Jia)、埃文·谢尔哈默(E. Shelhamer)、杰夫·多纳休(J. Donahue)、谢尔盖·卡拉耶夫(S. Karayev)、乔纳森·朗(J. Long)、罗斯·吉里希克(R. Girshick)、塞尔吉奥·瓜达拉马(S. Guadar-rama)和特雷弗·达雷尔(T. Darrell)。Caffe:用于快速特征嵌入的卷积架构。见《第22届ACM国际多媒体会议论文集》,第675 - 678页,2014年。

[112] K. N. Kaipa, S. S. Thevendria-Karthic, S. Shriyam, A. M. Kabir, J. D. Langsfeld, and S. K. Gupta. Resolving automated perception system failures in bin-picking tasks using assistance from remote human operators. In 2015 IEEE International Conference on Automation Science and Engineering (CASE), pages 1453-1458. IEEE, 2015.
[112] 卡帕·K·N(K. N. Kaipa)、塞文德里亚 - 卡尔蒂克·S·S(S. S. Thevendria-Karthic)、什里亚姆·S(S. Shriyam)、卡比尔·A·M(A. M. Kabir)、兰斯菲尔德·J·D(J. D. Langsfeld)和古普塔·S·K(S. K. Gupta)。利用远程人工操作员的协助解决料箱拾取任务中自动感知系统的故障。见《2015年IEEE自动化科学与工程国际会议(CASE)》,第1453 - 1458页。IEEE,2015年。

[113] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128-3137, 2015.
[113] 安德烈·卡帕西(A. Karpathy)和李菲菲(L. Fei-Fei)。用于生成图像描述的深度视觉 - 语义对齐。见《IEEE计算机视觉与模式识别会议论文集》,第3128 - 3137页,2015年。

[114] R. Kemker, M. McClure, A. Abitino, T. L. Hayes, and C. Kanan. Measuring catastrophic forgetting in neural networks. In Thirty-second AAAI conference on artificial intelligence, 2018.
[114] 雷·凯姆克(R. Kemker)、马克·麦克卢尔(M. McClure)、亚历克斯·阿比蒂诺(A. Abitino)、海斯·T·L(T. L. Hayes)和卡南·C(C. Kanan)。测量神经网络中的灾难性遗忘。见《第三十二届AAAI人工智能会议》,2018年。

[115] A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba. Undoing the damage of dataset bias. In European Conference on Computer Vision, pages 158-171, 2012.
[115] 阿迪亚·科斯拉(A. Khosla)、周涛(T. Zhou)、托马斯·马利塞维奇(T. Malisiewicz)、阿列克谢·A·埃弗罗斯(A. A. Efros)和安东尼奥·托拉尔巴(A. Torralba)。消除数据集偏差的影响。见《欧洲计算机视觉会议》,第158 - 171页,2012年。

[116] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
[116] 迪德里克·P·金马(D. P. Kingma)和吉米·巴(J. Ba)。Adam:一种随机优化方法。见《国际学习表征会议》,2015年。

[117] Z. Kira. Transfer of sparse coding representations and object classifiers across heterogeneous robots. In 2014 IEEE/RSJ Ineternational Conference on Intelligent Robots and Systems, pages 2209-2215. IEEE, 2014.
[117] 基拉·Z(Z. Kira)。跨异构机器人的稀疏编码表示和目标分类器的迁移。见《2014年IEEE/RSJ国际智能机器人与系统会议》,第2209 - 2215页。IEEE,2014年。

[118] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521-3526, 2017.
[118] 詹姆斯·柯克帕特里克(J. Kirkpatrick)、拉法尔·帕斯卡努(R. Pascanu)、尼尔·拉宾诺维茨(N. Rabinowitz)、乔尔·文斯(J. Veness)、纪尧姆·德雅尔丹(G. Desjardins)、安德烈·A·鲁苏(A. A. Rusu)、基里尔·米兰(K. Milan)、约翰·泉(J. Quan)、蒂亚戈·拉马尔霍(T. Ramalho)、阿格涅兹卡·格拉布斯卡 - 巴温斯卡(A. Grabska-Barwinska)等。克服神经网络中的灾难性遗忘。《美国国家科学院院刊》,114(13):3521 - 3526,2017年。

[119] E. Kodirov, T. Xiang, Z. Fu, and S. Gong. Unsupervised domain adaptation for zero-shot learning. In Proceedings of the IEEE international conference on computer vision, pages 2452-2460, 2015.
[119] 埃利达尔·科迪罗夫(E. Kodirov)、向涛(T. Xiang)、傅智(Z. Fu)和龚少刚(S. Gong)。用于零样本学习的无监督域适应。见《IEEE国际计算机视觉会议论文集》,第2452 - 2460页,2015年。

[120] I. Kostavelis and A. Gasteratos. Semantic mapping for mobile robotics tasks: A survey. Robotics and Autonomous Systems, 66:86-103, 2015.
[120] 伊利亚斯·科斯塔维利斯(I. Kostavelis)和阿波斯托洛斯·加斯特拉托斯(A. Gasteratos)。移动机器人任务的语义地图构建:综述。《机器人与自主系统》,66:86 - 103,2015年。

[121] J. Kozerawski and M. Turk. Clear: Cumulative learning for one-shot one-class image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3446-3455, 2018.
[121] 杰伊·科泽拉夫斯基(J. Kozerawski)和马蒂亚斯·图尔克(M. Turk)。Clear:用于一次性单类图像识别的累积学习。见《IEEE计算机视觉与模式识别会议论文集》,第3446 - 3455页,2018年。

[122] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554-561, 2013.
[122] 乔纳森·克劳斯(J. Krause)、迈克尔·斯塔克(M. Stark)、邓嘉(J. Deng)和李菲菲(L. Fei-Fei)。用于细粒度分类的3D目标表示。见《IEEE国际计算机视觉研讨会论文集》,第554 - 561页,2013年。

[123] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
[123] 亚历克斯·克里兹维斯基(A. Krizhevsky)和杰弗里·辛顿(G. Hinton)。从微小图像中学习多层特征。技术报告,多伦多大学,2009年。

[124] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097-1105, 2012.
[124] 亚历克斯·克里兹维斯基(A. Krizhevsky)、伊利亚·苏茨克维(I. Sutskever)和杰弗里·E·辛顿(G. E. Hinton)。使用深度卷积神经网络进行ImageNet分类。见《神经信息处理系统进展》,第1097 - 1105页,2012年。

[125] I. Kuzborskij, F. Orabona, and B. Caputo. From n to n+1: Multiclass transfer incremental learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3358-3365, 2013.
[125] 伊戈尔·库兹博尔斯基(I. Kuzborskij)、弗朗切斯科·奥拉博纳(F. Orabona)和比阿特丽斯·卡普托(B. Caputo)。从n到n + 1:多类迁移增量学习。见《IEEE计算机视觉与模式识别会议论文集》,第3358 - 3365页,2013年。

[126] M. Lagunes-Fortiz, D. Damen, and W. Mayol-Cuevas. Learning discriminative embeddings for object recognition on-the-fly. In 2019 International Conference on Robotics and Automation (ICRA), pages 2932-2938. IEEE, 2019.
[126] 马科·拉古内斯 - 福尔蒂斯(M. Lagunes-Fortiz)、达维德·达门(D. Damen)和威廉·马约尔 - 克瓦斯(W. Mayol-Cuevas)。实时学习用于目标识别的判别性嵌入。见《2019年国际机器人与自动化会议(ICRA)》,第2932 - 2938页。IEEE,2019年。

[127] K. Lai, L. Bo, X. Ren, and D. Fox. A large-scale hierarchical multi-view rgb-d object dataset. In 2011 IEEE international conference on robotics and automation, pages 1817-1824. IEEE, 2011.
[127] 赖(K. Lai)、博(L. Bo)、任(X. Ren)和福克斯(D. Fox)。一个大规模分层多视角RGB - D物体数据集。见2011年电气与电子工程师协会(IEEE)国际机器人与自动化会议论文集,第1817 - 1824页。电气与电子工程师协会,2011年。

[128] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332- 1338, 2015.
[128] 莱克(B. M. Lake)、萨拉胡季诺夫(R. Salakhutdinov)和特南鲍姆(J. B. Tenenbaum)。通过概率程序归纳实现人类水平的概念学习。《科学》,350(6266):1332 - 1338,2015年。

[129] C. H. Lampert. Predicting the future behavior of a time-varying probability distribution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 942-950, 2015.
[129] 兰珀特(C. H. Lampert)。预测时变概率分布的未来行为。见电气与电子工程师协会计算机视觉与模式识别会议论文集,第942 - 950页,2015年。

[130] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE transactions on pattern analysis and machine intelligence, 36(3):453-465, 2013.
[130] 兰珀特(C. H. Lampert)、尼克施(H. Nickisch)和哈梅林(S. Harmeling)。基于属性的分类用于零样本视觉物体分类。电气与电子工程师协会模式分析与机器智能汇刊,36(3):453 - 465,2013年。

[131] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, 1998.
[131] 勒昆(Y. LeCun)、博托(L. Bottou)、本吉奥(Y. Bengio)和哈夫纳(P. Haffner)。基于梯度的学习在文档识别中的应用。《电气与电子工程师协会汇刊》,86(11):2278 - 2324,1998年。

[132] T. Lesort, V. Lomonaco, A. Stoian, D. Maltoni, D. Filliat, and N. Díaz-Rodríguez. Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges. Information Fusion, 58:52-68, 2020.
[132] 勒索(T. Lesort)、洛莫纳科(V. Lomonaco)、斯托扬(A. Stoian)、马尔托尼(D. Maltoni)、菲利亚特(D. Filliat)和迪亚斯 - 罗德里格斯(N. Díaz - Rodríguez)。机器人的持续学习:定义、框架、学习策略、机遇与挑战。《信息融合》,58:52 - 68,2020年。

[133] D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pages 5542-5550, 2017.
[133] 李(D. Li)、杨(Y. Yang)、宋(Y. - Z. Song)和霍斯佩代尔斯(T. M. Hospedales)。更深、更广、更具艺术性的领域泛化。见电气与电子工程师协会国际计算机视觉会议论文集,第5542 - 5550页,2017年。

[134] D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales. Learning to generalize: Meta-learning for domain generalization. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[134] 李(D. Li)、杨(Y. Yang)、宋(Y. - Z. Song)和霍斯佩代尔斯(T. M. Hospedales)。学习泛化:用于领域泛化的元学习。见第三十二届美国人工智能协会会议论文集,2018年。

[135] D. Li, J. Zhang, Y. Yang, C. Liu, Y.-Z. Song, and T. M. Hospedales. Episodic training for domain generalization. In Proceedings of the IEEE International Conference on Computer Vision, pages 1446-1455, 2019.
[135] 李(D. Li)、张(J. Zhang)、杨(Y. Yang)、刘(C. Liu)、宋(Y. - Z. Song)和霍斯佩代尔斯(T. M. Hospedales)。用于领域泛化的情节式训练。见电气与电子工程师协会国际计算机视觉会议论文集,第1446 - 1455页,2019年。

[136] F. Li and H. Wechsler. Open set face recognition using transduction. IEEE transactions on pattern analysis and machine intelligence, 27(11):1686-1697, 2005.
[136] 李(F. Li)和韦克斯勒(H. Wechsler)。使用转导的开放集人脸识别。电气与电子工程师协会模式分析与机器智能汇刊,27(11):1686 - 1697,2005年。

[137] H. Li, S. Jialin Pan, S. Wang, and A. C. Kot. Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5400-5409, 2018.
[137] 李(H. Li)、潘嘉琳(S. Jialin Pan)、王(S. Wang)和科特(A. C. Kot)。通过对抗特征学习实现领域泛化。见电气与电子工程师协会计算机视觉与模式识别会议论文集,第5400 - 5409页,2018年。

[138] W. Li, Z. Xu, D. Xu, D. Dai, and L. Van Gool. Domain generalization and adaptation using low rank exemplar svms. IEEE transactions on pattern analysis and machine intelligence, 40(5):1114-1127, 2017.
[138] 李(W. Li)、徐(Z. Xu)、徐(D. Xu)、戴(D. Dai)和范古尔(L. Van Gool)。使用低秩样本支持向量机进行领域泛化和适应。电气与电子工程师协会模式分析与机器智能汇刊,40(5):1114 - 1127,2017年。

[139] W. Li, Z. Xu, D. Xu, D. Dai, and L. Van Gool. Domain generalization and adaptation using low rank exemplar svms. IEEE transactions on pattern analysis and machine intelligence, 40(5):1114-1127, 2018.
[139] 李(W. Li)、徐(Z. Xu)、徐(D. Xu)、戴(D. Dai)和范古尔(L. Van Gool)。使用低秩样本支持向量机进行领域泛化和适应。电气与电子工程师协会模式分析与机器智能汇刊,40(5):1114 - 1127,2018年。

[140] Y. Li and N. Vasconcelos. Efficient multi-domain learning by covariance normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5424-5433, 2019.
[140] 李(Y. Li)和瓦斯康塞洛斯(N. Vasconcelos)。通过协方差归一化实现高效多领域学习。见电气与电子工程师协会计算机视觉与模式识别会议论文集,第5424 - 5433页,2019年。

[141] Y. Li, N. Wang, J. Shi, X. Hou, and J. Liu. Adaptive batch normalization for practical domain adaptation. Pattern Recognition, 80:109-117, 2018.
[141] 李(Y. Li)、王(N. Wang)、施(J. Shi)、侯(X. Hou)和刘(J. Liu)。用于实际领域适应的自适应批量归一化。《模式识别》,80:109 - 117,2018年。

[142] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou. Revisiting batch normalization for practical domain adaptation. arXiv preprint arXiv:1603.04779, 2016.
[142] 李(Y. Li)、王(N. Wang)、施(J. Shi)、刘(J. Liu)和侯(X. Hou)。重新审视用于实际领域适应的批量归一化。预印本arXiv:1603.04779,2016年。

[143] Y. Li, Y. Yang, W. Zhou, and T. Hospedales. Feature-critic networks for heterogeneous domain generalization. In International Conference on Machine Learning, pages 3915-3924, 2019.
[143] 李阳(Y. Li)、杨宇(Y. Yang)、周伟(W. Zhou)和霍斯佩代尔斯(T. Hospedales)。用于异构领域泛化的特征评判网络。见《国际机器学习会议》,第3915 - 3924页,2019年。

[144] Z. Li and D. Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935-2947, 2017.
[144] 李志(Z. Li)和霍耶姆(D. Hoiem)。不遗忘学习。《电气与电子工程师协会模式分析与机器智能汇刊》,40(12):2935 - 2947,2017年。

[145] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
[145] 利利克拉普(T. P. Lillicrap)、亨特(J. J. Hunt)、普里策尔(A. Pritzel)、赫斯(N. Heess)、埃雷兹(T. Erez)、塔萨(Y. Tassa)、西尔弗(D. Silver)和维斯特拉(D. Wierstra)。基于深度强化学习的连续控制。预印本arXiv:1509.02971,2015年。

[146] G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1925-1934, 2017.
[146] 林刚(G. Lin)、米兰(A. Milan)、沈春华(C. Shen)和里德(I. Reid)。RefineNet:用于高分辨率语义分割的多路径细化网络。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第1925 - 1934页,2017年。

[147] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740-755. Springer, 2014.
[147] 林天宇(T.-Y. Lin)、梅尔(M. Maire)、贝隆吉(S. Belongie)、海斯(J. Hays)、佩罗纳(P. Perona)、拉马南(D. Ramanan)、多尔(P. Dollár)和齐特尼克(C. L. Zitnick)。微软COCO数据集:上下文中的常见物体。见《欧洲计算机视觉会议》,第740 - 755页。施普林格出版社,2014年。

[148] Y. Lin, J. Chen, Y. Cao, Y. Zhou, L. Zhang, Y. Y. Tang, and S. Wang. Cross-domain recognition by identifying joint subspaces of source domain and target domain. IEEE Transactions on Cybernetics, 47(4):1090-1101, 2017.
[148] 林宇(Y. Lin)、陈杰(J. Chen)、曹宇(Y. Cao)、周洋(Y. Zhou)、张磊(L. Zhang)、唐宇阳(Y. Y. Tang)和王硕(S. Wang)。通过识别源域和目标域的联合子空间进行跨域识别。《电气与电子工程师协会控制论汇刊》,47(4):1090 - 1101,2017年。

[149] M.-Y. Liu, O. Tuzel, A. Veeraraghavan, Y. Taguchi, T. K. Marks, and R. Chel-lappa. Fast object localization and pose estimation in heavy clutter for robotic bin picking. The International Journal of Robotics Research, 31(8):951-973, 2012.
[149] 刘梦怡(M.-Y. Liu)、图泽尔(O. Tuzel)、维拉拉加万(A. Veeraraghavan)、田口裕(Y. Taguchi)、马克斯(T. K. Marks)和切拉帕(R. Chel-lappa)。用于机器人料箱拾取的复杂杂乱场景下的快速目标定位与姿态估计。《国际机器人研究杂志》,31(8):951 - 973,2012年。

[150] S. Liu, E. Johns, and A. J. Davison. End-to-end multi-task learning with attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1871-1880, 2019.
[150] 刘硕(S. Liu)、约翰斯(E. Johns)和戴维森(A. J. Davison)。基于注意力机制的端到端多任务学习。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第1871 - 1880页,2019年。

[151] V. Lomonaco and D. Maltoni. Core50: a new dataset and benchmark for continuous object recognition. In Conference on Robot Learning, pages 17-26, 2017.
[151] 洛莫纳科(V. Lomonaco)和马尔托尼(D. Maltoni)。Core50:用于连续目标识别的新数据集和基准。见《机器人学习会议》,第17 - 26页,2017年。

[152] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431-3440, 2015.
[152] 龙军(J. Long)、谢尔哈默(E. Shelhamer)和达雷尔(T. Darrell)。用于语义分割的全卷积网络。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第3431 - 3440页,2015年。

[153] M. Long, G. Ding, J. Wang, J. Sun, Y. Guo, and P. S. Yu. Transfer sparse coding for robust image representation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 407-414, 2013.
[153] 龙明盛(M. Long)、丁国栋(G. Ding)、王军(J. Wang)、孙剑(J. Sun)、郭宇(Y. Guo)和于鹏(P. S. Yu)。用于鲁棒图像表示的迁移稀疏编码。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第407 - 414页,2013年。

[154] M. Long and J. Wang. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pages 97-105, 2015.
[154] 龙明盛(M. Long)和王军(J. Wang)。使用深度自适应网络学习可迁移特征。见《国际机器学习会议》,第97 - 105页,2015年。

[155] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised domain adaptation with residual transfer networks. In Advances in neural information processing systems, pages 136-144, 2016.
[155] 龙明盛(M. Long)、朱涵(H. Zhu)、王军(J. Wang)和乔丹(M. I. Jordan)。使用残差迁移网络进行无监督域自适应。见《神经信息处理系统进展》,第136 - 144页,2016年。

[156] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2208-2217. JMLR, 2017.
[156] 龙明盛(M. Long)、朱涵(H. Zhu)、王军(J. Wang)和乔丹(M. I. Jordan)。使用联合自适应网络进行深度迁移学习。见《第34届国际机器学习会议论文集 - 第70卷》,第2208 - 2217页。机器学习研究杂志,2017年。

[157] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. In ICML, 2017.
[157] 龙明盛(M. Long)、朱涵(H. Zhu)、王军(J. Wang)和乔丹(M. I. Jordan)。使用联合自适应网络进行深度迁移学习。见《国际机器学习会议》,2017年。

[158] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91-110, 2004.
[158] 洛(D. G. Lowe)。基于尺度不变关键点的独特图像特征。《国际计算机视觉杂志》,60(2):91 - 110,2004年。

[159] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
[159] S. 马吉(S. Maji)、E. 拉赫图(E. Rahtu)、J. 卡纳拉(J. Kannala)、M. 布拉施科(M. Blaschko)和 A. 韦尔代尔迪(A. Vedaldi)。飞机的细粒度视觉分类。预印本 arXiv:1306.5151,2013 年。

[160] A. Mallya, D. Davis, and S. Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), pages 67-82, 2018.
[160] A. 马利亚(A. Mallya)、D. 戴维斯(D. Davis)和 S. 拉泽布尼克(S. Lazebnik)。搭载式学习:通过学习掩码权重使单个网络适应多个任务。收录于《欧洲计算机视觉会议论文集》(Proceedings of the European Conference on Computer Vision,ECCV),第 67 - 82 页,2018 年。

[161] A. Mallya and S. Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7765-7773, 2018.
[161] A. 马利亚(A. Mallya)和 S. 拉泽布尼克(S. Lazebnik)。打包网络:通过迭代剪枝为单个网络添加多个任务。收录于《电气与电子工程师协会计算机视觉与模式识别会议论文集》(Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition),第 7765 - 7773 页,2018 年。

[162] M. Mancini, Z. Akata, E. Ricci, and B. Caputo. Towards recognizing unseen categories in unseen domains. technical report, 2020.
[162] M. 曼奇尼(M. Mancini)、Z. 阿卡塔(Z. Akata)、E. 里奇(E. Ricci)和 B. 卡普托(B. Caputo)。迈向识别未见领域中的未见类别。技术报告,2020 年。

[163] M. Mancini, S. R. Bulò, B. Caputo, and E. Ricci. Best sources forward: Domain generalization through source-specific nets. In IEEE International Conference on Image Processing (ICIP), 2018.
[163] M. 曼奇尼(M. Mancini)、S. R. 布卢(S. R. Bulò)、B. 卡普托(B. Caputo)和 E. 里奇(E. Ricci)。最佳源向前:通过特定源网络实现领域泛化。收录于《电气与电子工程师协会国际图像处理会议》(IEEE International Conference on Image Processing,ICIP),2018 年。

[164] M. Mancini, S. R. Bulo, B. Caputo, and E. Ricci. Robust place categorization with deep domain generalization. IEEE Robotics and Automation Letters, 3(3):20932100,2018 .
[164] M. 曼奇尼(M. Mancini)、S. R. 布卢(S. R. Bulo)、B. 卡普托(B. Caputo)和 E. 里奇(E. Ricci)。基于深度领域泛化的鲁棒场所分类。《电气与电子工程师协会机器人与自动化快报》(IEEE Robotics and Automation Letters),3(3):20932100,2018

[165] M. Mancini, S. R. Bulò, B. Caputo, and E. Ricci. Adagraph: Unifying predictive and continuous domain adaptation through graphs. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[165] M. 曼奇尼(M. Mancini)、S. R. 布卢(S. R. Bulò)、B. 卡普托(B. Caputo)和 E. 里奇(E. Ricci)。自适应图:通过图统一预测性和连续性领域自适应。收录于《电气与电子工程师协会计算机视觉与模式识别会议》(The IEEE Conference on Computer Vision and Pattern Recognition,CVPR),2019 年。

[166] M. Mancini, H. Karaoguz, E. Ricci, P. Jensfelt, and B. Caputo. Kitting in the wild through online domain adaptation. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018.
[166] M. 曼奇尼(M. Mancini)、H. 卡拉奥古兹(H. Karaoguz)、E. 里奇(E. Ricci)、P. 延斯费尔特(P. Jensfelt)和 B. 卡普托(B. Caputo)。通过在线领域自适应实现野外套件组装。收录于《电气与电子工程师协会/日本机器人协会国际智能机器人与系统会议》(IEEE/RSJ International Conference on Intelligent Robots and Systems,IROS),2018 年。

[167] M. Mancini, H. Karaoguz, E. Ricci, P. Jensfelt, and B. Caputo. Knowledge is never enough: Towards web aided deep open world recognition. In IEEE International Conference on Robotics and Automation (ICRA), 2019.
[167] M. 曼奇尼(M. Mancini)、H. 卡拉奥古兹(H. Karaoguz)、E. 里奇(E. Ricci)、P. 延斯费尔特(P. Jensfelt)和 B. 卡普托(B. Caputo)。知识永远不够:迈向网络辅助的深度开放世界识别。收录于《电气与电子工程师协会国际机器人与自动化会议》(IEEE International Conference on Robotics and Automation,ICRA),2019 年。

[168] M. Mancini, L. Porzi, S. Bulo, B. Caputo, and E. Ricci. Inferring latent domains for unsupervised deep domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 2019.
[168] M. 曼奇尼(M. Mancini)、L. 波尔齐(L. Porzi)、S. 布卢(S. Bulo)、B. 卡普托(B. Caputo)和 E. 里奇(E. Ricci)。无监督深度领域自适应中的潜在领域推断。《电气与电子工程师协会模式分析与机器智能汇刊》(IEEE transactions on pattern analysis and machine intelligence),2019 年。

[169] M. Mancini, L. Porzi, S. R. Bulò, B. Caputo, and E. Ricci. Boosting domain adaptation by discovering latent domains. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[169] M. 曼奇尼(M. Mancini)、L. 波尔齐(L. Porzi)、S. R. 布卢(S. R. Bulò)、B. 卡普托(B. Caputo)和 E. 里奇(E. Ricci)。通过发现潜在领域提升领域自适应能力。收录于《电气与电子工程师协会计算机视觉与模式识别会议》(IEEE Conference on Computer Vision and Pattern Recognition,CVPR),2018 年。

[170] M. Mancini, L. Porzi, F. Cermelli, and B. Caputo. Discovering latent domains for unsupervised domain adaptation through consistency. In International Conference on Image Analysis and Processing, pages 390-401. Springer, 2019.
[170] M. 曼奇尼(M. Mancini)、L. 波尔齐(L. Porzi)、F. 切尔梅利(F. Cermelli)和 B. 卡普托(B. Caputo)。通过一致性发现无监督领域自适应中的潜在领域。收录于《国际图像分析与处理会议》(International Conference on Image Analysis and Processing),第 390 - 401 页。施普林格出版社,2019 年。

[171] M. Mancini, E. Ricci, B. Caputo, and S. R. Bulò. Adding new tasks to a single network with weight transformations using binary masks. In The European Conference on Computer Vision (ECCV) Workshops. Springer, 2018.
[171] M. 曼奇尼(M. Mancini)、E. 里奇(E. Ricci)、B. 卡普托(B. Caputo)和 S. R. 布卢(S. R. Bulò)。使用二进制掩码进行权重变换为单个网络添加新任务。收录于《欧洲计算机视觉会议研讨会论文集》(The European Conference on Computer Vision,ECCV Workshops)。施普林格出版社,2018 年。

[172] M. Mancini, E. Ricci, B. Caputo, and S. R. Bulò. Boosting binary masks for multi-domain learning through affine transformations. Machine Vision and Applications, 31(6):1-14, 2020.
[172] M. 曼奇尼(M. Mancini)、E. 里奇(E. Ricci)、B. 卡普托(B. Caputo)和 S. R. 布卢(S. R. Bulò)。通过仿射变换提升多领域学习的二进制掩码效果。《机器视觉与应用》(Machine Vision and Applications),31(6):1 - 14,2020 年。

[173] M. Mancini, S. Rota Bulò, E. Ricci, and B. Caputo. Learning deep nbnn representations for robust place categorization. IEEE Robotics and Automation Letters, 2(3):1794-1801, 2017.
[173] M. 曼奇尼(M. Mancini)、S. 罗塔·布卢(S. Rota Bulò)、E. 里奇(E. Ricci)和 B. 卡普托(B. Caputo)。学习深度最近邻非邻居(NBNN)表示用于鲁棒场所分类。《电气与电子工程师协会机器人与自动化快报》(IEEE Robotics and Automation Letters),2(3):1794 - 1801,2017 年。

[174] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation with multiple sources. In Advances in neural information processing systems, pages 1041-1048, 2009.
[174] Y. 曼苏尔(Y. Mansour)、M. 莫赫里(M. Mohri)和 A. 罗斯塔米扎德(A. Rostamizadeh)。多源领域自适应。收录于《神经信息处理系统进展》(Advances in neural information processing systems),第 1041 - 1048 页,2009 年。

[175] M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109-165. Elsevier, 1989.
[175] M. 麦克洛斯基(M. McCloskey)和 N. J. 科恩(N. J. Cohen)。连接主义网络中的灾难性干扰:顺序学习问题。载于《学习与动机心理学》(Psychology of learning and motivation),第 24 卷,第 109 - 165 页。爱思唯尔出版社(Elsevier),1989 年。

[176] J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison. Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In Proceedings of the IEEE International Conference on Computer Vision, pages 2678-2687, 2017.
[176] J. 麦科马克(J. McCormac)、A. 汉达(A. Handa)、S. 洛特内格(S. Leutenegger)和 A. J. 戴维森(A. J. Davison)。场景网络 RGB - D(Scenenet rgb - d):500 万张合成图像能否在室内分割任务上超越通用的 ImageNet 预训练模型?载于《电气与电子工程师协会国际计算机视觉会议论文集》(Proceedings of the IEEE International Conference on Computer Vision),第 2678 - 2687 页,2017 年。

[177] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In European Conference on Computer Vision, pages 488-501. Springer, 2012.
[177] T. 门辛克(T. Mensink)、J. 韦贝克(J. Verbeek)、F. 佩罗宁(F. Perronnin)和 G. 丘尔卡(G. Csurka)。大规模图像分类的度量学习:以近乎零成本泛化到新类别。载于《欧洲计算机视觉会议》(European Conference on Computer Vision),第 488 - 501 页。施普林格出版社(Springer),2012 年。

[178] U. Michieli and P. Zanuttigh. Incremental learning techniques for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0-0, 2019.
[178] U. 米基耶利(U. Michieli)和 P. 扎努蒂(P. Zanuttigh)。语义分割的增量学习技术。载于《电气与电子工程师协会国际计算机视觉研讨会论文集》(Proceedings of the IEEE International Conference on Computer Vision Workshops),第 0 - 0 页,2019 年。

[179] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[179] T. 米科洛夫(T. Mikolov)、K. 陈(K. Chen)、G. 科拉多(G. Corrado)和 J. 迪恩(J. Dean)。向量空间中词表示的高效估计。预印本 arXiv:1301.3781,2013 年。

[180] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111-3119, 2013.
[180] T. 米科洛夫(T. Mikolov)、I. 苏茨克维(I. Sutskever)、K. 陈(K. Chen)、G. S. 科拉多(G. S. Corrado)和 J. 迪恩(J. Dean)。词和短语的分布式表示及其组合性。载于《神经信息处理系统进展》(Advances in neural information processing systems),第 3111 - 3119 页,2013 年。

[181] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3994-4003, 2016.
[181] I. 米斯拉(I. Misra)、A. 什里瓦斯塔瓦(A. Shrivastava)、A. 古普塔(A. Gupta)和 M. 赫伯特(M. Hebert)。用于多任务学习的十字绣网络。载于《电气与电子工程师协会计算机视觉与模式识别会议论文集》(Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition),第 3994 - 4003 页,2016 年。

[182] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529-533, 2015.
[182] V. 米尼(V. Mnih)、K. 卡武库奥卢(K. Kavukcuoglu)、D. 西尔弗(D. Silver)、A. A. 鲁苏(A. A. Rusu)、J. 维内斯(J. Veness)、M. G. 贝勒马尔(M. G. Bellemare)、A. 格雷夫斯(A. Graves)、M. 里德米勒(M. Riedmiller)、A. K. 菲耶兰德(A. K. Fidjeland)、G. 奥斯特罗夫斯基(G. Ostrovski)等。通过深度强化学习实现人类水平的控制。《自然》(Nature),518(7540):529 - 533,2015 年。

[183] P. Morerio, J. Cavazza, and V. Murino. Minimal-entropy correlation alignment for unsupervised deep domain adaptation. In International Conference on Learning Representations, 2018.
[183] P. 莫雷里奥(P. Morerio)、J. 卡瓦扎(J. Cavazza)和 V. 穆里诺(V. Murino)。用于无监督深度域适应的最小熵相关性对齐。载于《国际学习表征会议》(International Conference on Learning Representations),2018 年。

[184] P. Morgado and N. Vasconcelos. Nettailor: Tuning the architecture, not just the weights. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3044-3054, 2019.
[184] P. 莫尔加多(P. Morgado)和 N. 瓦斯康塞洛斯(N. Vasconcelos)。网络裁缝(Nettailor):调整架构,而非仅仅调整权重。载于《电气与电子工程师协会计算机视觉与模式识别会议论文集》(Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition),第 3044 - 3054 页,2019 年。

[185] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto. Unified deep supervised domain adaptation and generalization. In Proceedings of the IEEE International Conference on Computer Vision, pages 5715-5725, 2017.
[185] S. 莫蒂安(S. Motiian)、M. 皮西里利(M. Piccirilli)、D. A. 阿杰罗(D. A. Adjeroh)和 G. 多雷托(G. Doretto)。统一的深度监督域适应与泛化。载于《电气与电子工程师协会国际计算机视觉会议论文集》(Proceedings of the IEEE International Conference on Computer Vision),第 5715 - 5725 页,2017 年。

[186] K. Muandet, D. Balduzzi, and B. Schölkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pages 10-18, 2013.
[186] K. 穆安德特(K. Muandet)、D. 巴尔杜齐(D. Balduzzi)和 B. 肖尔科普夫(B. Schölkopf)。通过不变特征表示进行域泛化。载于《国际机器学习会议》(International Conference on Machine Learning),第 10 - 18 页,2013 年。

[187] S. Munder and D. M. Gavrila. An experimental study on pedestrian classification. IEEE transactions on pattern analysis and machine intelligence, 28(11):1863-1868, 2006.
[187] S. 蒙德尔(S. Munder)和 D. M. 加夫里拉(D. M. Gavrila)。行人分类的实验研究。《电气与电子工程师协会模式分析与机器智能汇刊》(IEEE transactions on pattern analysis and machine intelligence),28(11):1863 - 1868,2006 年。

[188] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807, 2015.
[188] A. 尼尔卡坦(A. Neelakantan)、L. 维尔尼斯(L. Vilnis)、Q. V. 勒(Q. V. Le)、I. 苏茨克维(I. Sutskever)、L. 凯泽(L. Kaiser)、K. 库拉奇(K. Kurach)和 J. 马滕斯(J. Martens)。添加梯度噪声可改善极深网络的学习效果。预印本 arXiv:1511.06807,2015 年。

[189] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS-WS on deep learning and unsupervised feature learning, 2011.
[189] Y. 内策尔(Y. Netzer)、T. 王(T. Wang)、A. 科茨(A. Coates)、A. 比萨科(A. Bissacco)、B. 吴(B. Wu)和 A. Y. 吴(A. Y. Ng)。通过无监督特征学习读取自然图像中的数字。载于《神经信息处理系统大会深度学习与无监督特征学习研讨会》(NIPS - WS on deep learning and unsupervised feature learning),2011 年。

[190] H. V. Nguyen, H. T. Ho, V. M. Patel, and R. Chellappa. Dash-n: Joint hierarchical domain adaptation and feature learning. IEEE Transactions on Image Processing, 24(12):5479-5491, 2015.
[190] H. V. 阮(H. V. Nguyen)、H. T. 何(H. T. Ho)、V. M. 帕特尔(V. M. Patel)和 R. 切拉帕(R. Chellappa)。Dash - N:联合分层域适应与特征学习。《电气与电子工程师协会图像处理汇刊》(IEEE Transactions on Image Processing),24(12):5479 - 5491,2015 年。

[191] M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722-729. IEEE, 2008.
[191] M.-E. 尼尔兹巴克(M.-E. Nilsback)和 A. 齐瑟曼(A. Zisserman)。大规模类别花卉自动分类。见《2008 第六届印度计算机视觉、图形与图像处理会议论文集》,第 722 - 729 页。电气与电子工程师协会(IEEE),2008 年。

[192] M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722-729. IEEE, 2008.
[192] M.-E. 尼尔兹巴克(M.-E. Nilsback)和 A. 齐瑟曼(A. Zisserman)。大规模类别花卉自动分类。见《2008 第六届印度计算机视觉、图形与图像处理会议论文集》,第 722 - 729 页。电气与电子工程师协会(IEEE),2008 年。

[193] L. Niu, Q. Tang, A. Veeraraghavan, and A. Sabharwal. Learning from noisy web data with category-level supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7689-7698, 2018.
[193] 牛立(L. Niu)、唐奇(Q. Tang)、A. 维拉拉加万(A. Veeraraghavan)和 A. 萨伯哈瓦尔(A. Sabharwal)。基于类别级监督从嘈杂网络数据中学习。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 7689 - 7698 页,2018 年。

[194] M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69-84. Springer, 2016.
[194] M. 诺鲁兹(M. Noroozi)和 P. 法瓦罗(P. Favaro)。通过解决拼图游戏进行视觉表征的无监督学习。见《欧洲计算机视觉会议论文集》,第 69 - 84 页。施普林格出版社(Springer),2016 年。

[195] O. Ostapenko, M. Puscas, T. Klein, P. Jahnichen, and M. Nabi. Learning to remember: A synaptic plasticity driven framework for continual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11321-11329, 2019.
[195] O. 奥斯特潘科(O. Ostapenko)、M. 普斯卡什(M. Puscas)、T. 克莱因(T. Klein)、P. 亚尼申(P. Jahnichen)和 M. 纳比(M. Nabi)。学会记忆:一种基于突触可塑性的持续学习框架。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 11321 - 11329 页,2019 年。

[196] F. Ozdemir, P. Fuernstahl, and O. Goksel. Learn the new, keep the old: Extending pretrained models with new anatomy and images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 361-369. Springer, 2018.
[196] F. 厄兹德米尔(F. Ozdemir)、P. 菲尔恩斯塔尔(P. Fuernstahl)和 O. 戈克塞尔(O. Goksel)。学习新内容,保留旧知识:用新的解剖结构和图像扩展预训练模型。见《国际医学图像计算与计算机辅助干预会议论文集》,第 361 - 369 页。施普林格出版社(Springer),2018 年。

[197] F. Ozdemir and O. Goksel. Extending pretrained segmentation networks with additional anatomical structures. International journal of computer assisted radiology and surgery, 14(7):1187-1195, 2019.
[197] F. 厄兹德米尔(F. Ozdemir)和 O. 戈克塞尔(O. Goksel)。用额外的解剖结构扩展预训练分割网络。《国际计算机辅助放射学与手术杂志》,14(7):1187 - 1195,2019 年。

[198] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199-210, 2010.
[198] 潘世健(S. J. Pan)、曾毅文(I. W. Tsang)、郭建涛(J. T. Kwok)和杨强(Q. Yang)。基于迁移成分分析的领域自适应。《电气与电子工程师协会神经网络汇刊》,22(2):199 - 210,2010 年。

[199] G. I. Parisi and C. Kanan. Rethinking continual learning for autonomous agents and robots. arXiv preprint arXiv:1907.01929, 2019.
[199] G. I. 帕里西(G. I. Parisi)和 C. 卡南(C. Kanan)。重新思考自主智能体和机器人的持续学习。预印本 arXiv:1907.01929,2019 年。

[200] G. I. Parisi, J. Tani, C. Weber, and S. Wermter. Lifelong learning of spatiotemporal representations with dual-memory recurrent self-organization. Frontiers in neurorobotics, 12:78, 2018.
[200] G. I. 帕里西(G. I. Parisi)、J. 塔尼(J. Tani)、C. 韦伯(C. Weber)和 S. 韦姆特(S. Wermter)。通过双记忆循环自组织进行时空表征的终身学习。《神经机器人前沿》,12:78,2018 年。

[201] G. Pasquale, C. Ciliberto, F. Odone, L. Rosasco, and L. Natale. Teaching icub to recognize objects using deep convolutional neural networks. In Machine Learning for Interactive Systems, pages 21-25, 2015.
[201] G. 帕斯夸莱(G. Pasquale)、C. 奇利贝托(C. Ciliberto)、F. 奥多内(F. Odone)、L. 罗萨斯科(L. Rosasco)和 L. 纳塔莱(L. Natale)。使用深度卷积神经网络教伊库伯机器人(iCub)识别物体。见《交互式系统机器学习》,第 21 - 25 页,2015 年。

[202] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024-8035, 2019.
[202] A. 帕斯齐克(A. Paszke)、S. 格罗斯(S. Gross)、F. 马萨(F. Massa)、A. 勒雷尔(A. Lerer)、J. 布拉德伯里(J. Bradbury)、G. 查南(G. Chanan)、T. 基林(T. Killeen)、Z. 林(Z. Lin)、N. 吉梅尔申(N. Gimelshein)、L. 安蒂加(L. Antiga)等。PyTorch:一种命令式风格的高性能深度学习库。见《神经信息处理系统进展》,第 8024 - 8035 页,2019 年。

[203] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: A survey of recent advances. IEEE signal processing magazine, 32(3):53-69, 2015.
[203] V. M. 帕特尔(V. M. Patel)、R. 戈帕兰(R. Gopalan)、R. 李(R. Li)和 R. 切拉帕(R. Chellappa)。视觉领域自适应:近期进展综述。《电气与电子工程师协会信号处理杂志》,32(3):53 - 69,2015 年。

[204] G. Patterson and J. Hays. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2751-2758. IEEE, 2012.
[204] G. 帕特森(G. Patterson)和 J. 海斯(J. Hays)。太阳属性数据库:发现、标注和识别场景属性。见《2012 电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 2751 - 2758 页。电气与电子工程师协会(IEEE),2012 年。

[205] K.-C. Peng, Z. Wu, and J. Ernst. Zero-shot deep domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 764-781, 2018.
[205] 彭科成(K.-C. Peng)、吴泽(Z. Wu)和 J. 恩斯特(J. Ernst)。零样本深度领域自适应。见《欧洲计算机视觉会议(ECCV)论文集》,第 764 - 781 页,2018 年。

[206] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1406-1415, 2019.
[206] 彭翔(X. Peng)、白强(Q. Bai)、夏鑫(X. Xia)、黄震(Z. Huang)、K. 萨内科(K. Saenko)和王博(B. Wang)。多源领域自适应的矩匹配。见《电气与电子工程师协会国际计算机视觉会议论文集》,第 1406 - 1415 页,2019 年。

[207] L. Porzi, S. R. Bulo, A. Colovic, and P. Kontschieder. Seamless scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8277-8286, 2019.
[207] L. 波尔齐(L. Porzi)、S. R. 布洛(S. R. Bulo)、A. 科洛维奇(A. Colovic)和 P. 孔奇德(P. Kontschieder)。无缝场景分割。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 8277 - 8286 页,2019 年。

[208] S. Prasath Elango, T. Tommasi, and B. Caputo. Transfer learning of visual concepts across robots: A discriminative approach. Technical report, Idiap, 2012.
[208] S. 普拉萨特·埃兰戈(S. Prasath Elango)、T. 托马西(T. Tommasi)和 B. 卡普托(B. Caputo)。跨机器人的视觉概念迁移学习:一种判别式方法。技术报告, Idiap 研究所,2012 年。

[209] A. Pronobis and B. Caputo. Cold: The cosy localization database. The International Journal of Robotics Research, 28(5):588-594, 2009.
[209] A. 普罗诺比斯(A. Pronobis)和 B. 卡普托(B. Caputo)。Cold:舒适的定位数据库。《国际机器人研究杂志》,28(5):588 - 594,2009 年。

[210] A. Pronobis, B. Caputo, P. Jensfelt, and H. I. Christensen. A realistic benchmark for visual indoor place recognition. Robotics and autonomous systems, 58(1):8196,2010 .
[210] A. 普罗诺比斯(A. Pronobis)、B. 卡普托(B. Caputo)、P. 延斯费尔特(P. Jensfelt)和 H. I. 克里斯滕森(H. I. Christensen)。视觉室内场所识别的现实基准。《机器人与自主系统》,58(1):8196,2010

[211] H. Qi, M. Brown, and D. G. Lowe. Low-shot learning with imprinted weights. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 58225830,2018 .
[211] H. 齐(H. Qi)、M. 布朗(M. Brown)和 D. G. 洛(D. G. Lowe)。基于印记权重的少样本学习。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 58225830,2018 页。

[212] F. Qiao, L. Zhao, and X. Peng. Learning to learn single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12556-12565, 2020.
[212] F. 乔(F. Qiao)、L. 赵(L. Zhao)和 X. 彭(X. Peng)。学习单领域泛化。见《电气与电子工程师协会/计算机视觉基金会计算机视觉与模式识别会议论文集》,第 12556 - 12565 页,2020 年。

[213] D. Rao, F. Visin, A. Rusu, R. Pascanu, Y. W. Teh, and R. Hadsell. Continual unsupervised representation learning. In Advances in Neural Information Processing Systems, pages 7647-7657, 2019.
[213] D. 饶(D. Rao)、F. 维辛(F. Visin)、A. 鲁苏(A. Rusu)、R. 帕斯卡努(R. Pascanu)、Y. W. Teh 和 R. 哈德塞尔(R. Hadsell)。持续无监督表征学习。见《神经信息处理系统进展》,第 7647 - 7657 页,2019 年。

[214] S.-A. Rebuffi, H. Bilen, and A. Vedaldi. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, pages 506-516, 2017.
[214] S.-A. 雷布菲(S.-A. Rebuffi)、H. 比伦(H. Bilen)和 A. 韦尔代利(A. Vedaldi)。使用残差适配器学习多个视觉领域。见《神经信息处理系统进展》,第 506 - 516 页,2017 年。

[215] S.-A. Rebuffi, H. Bilen, and A. Vedaldi. Efficient parametrization of multi-domain deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8119-8127, 2018.
[215] S.-A. 雷布菲(S.-A. Rebuffi)、H. 比伦(H. Bilen)和 A. 韦尔代利(A. Vedaldi)。多领域深度神经网络的高效参数化。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 8119 - 8127 页,2018 年。

[216] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001-2010, 2017.
[216] S.-A. 雷布菲(S.-A. Rebuffi)、A. 科列斯尼科夫(A. Kolesnikov)、G. 斯佩尔(G. Sperl)和 C. H. 兰佩特(C. H. Lampert)。icarl:增量式分类器与表征学习。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 2001 - 2010 页,2017 年。

[217] S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 49-58, 2016.
[217] S. 里德(S. Reed)、Z. 阿卡塔(Z. Akata)、H. 李(H. Lee)和 B. 席勒(B. Schiele)。学习细粒度视觉描述的深度表征。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 49 - 58 页,2016 年。

[218] K. Rematas, B. Fernando, T. Tommasi, and T. Tuytelaars. Does evolution cause a domain shift?, 2013.
[218] K. 雷马塔斯(K. Rematas)、B. 费尔南多(B. Fernando)、T. 托马西(T. Tommasi)和 T. 图伊特拉尔斯(T. Tuytelaars)。进化会导致领域偏移吗?,2013 年。

[219] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91-99, 2015.
[219] S. 任(S. Ren)、K. 何(K. He)、R. 吉里什克(R. Girshick)和 J. 孙(J. Sun)。Faster R - CNN:基于区域提议网络的实时目标检测。见《神经信息处理系统进展》,第 91 - 99 页,2015 年。

[220] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, 39(6):1137-1149, 2017.
[220] S. 任(S. Ren)、K. 何(K. He)、R. 吉里什克(R. Girshick)和 J. 孙(J. Sun)。Faster R - CNN:基于区域提议网络的实时目标检测。《电气与电子工程师协会模式分析与机器智能汇刊》,39(6):1137 - 1149,2017 年。

[221] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on International Conference on Machine Learning, page 833-840, Madison, WI, USA, 2011. Omnipress.
[221] S. 里法伊(S. Rifai)、P. 文森特(P. Vincent)、X. 穆勒(X. Muller)、X. 格洛罗(X. Glorot)和 Y. 本吉奥(Y. Bengio)。收缩自编码器:特征提取过程中的显式不变性。见《第 28 届国际机器学习会议论文集》,第 833 - 840 页,美国威斯康星州麦迪逊,2011 年。综合出版社。

[222] M. B. Ring. Child: A first step towards continual learning. Machine Learning, 28(1):77-104, 1997.
[222] M. B. 林(M. B. Ring)。Child:迈向持续学习的第一步。《机器学习》,28(1):77 - 104,1997 年。

[223] A. Rosenfeld and J. K. Tsotsos. Incremental learning through deep adaptation. IEEE transactions on pattern analysis and machine intelligence, 42(3):651-663, 2018.
[223] A. 罗森菲尔德(A. Rosenfeld)和 J. K. 索托索斯(J. K. Tsotsos)。通过深度自适应进行增量学习。《电气与电子工程师协会模式分析与机器智能汇刊》(IEEE transactions on pattern analysis and machine intelligence),42(3):651 - 663,2018 年。

[224] S. Rota Bulò, L. Porzi, and P. Kontschieder. In-place activated batchnorm for memory-optimized training of dnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5639-5647, 2018.
[224] S. 罗塔·布洛(S. Rota Bulò)、L. 波尔齐(L. Porzi)和 P. 孔奇德(P. Kontschieder)。用于深度神经网络(DNN)内存优化训练的就地激活批量归一化。《电气与电子工程师协会计算机视觉与模式识别会议论文集》(Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition),第 5639 - 5647 页,2018 年。

[225] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211- 252, 2015.
[225] O. 鲁萨科夫斯基(O. Russakovsky)、J. 邓(J. Deng)、H. 苏(H. Su)、J. 克劳斯(J. Krause)、S. 萨蒂什(S. Satheesh)、S. 马(S. Ma)、Z. 黄(Z. Huang)、A. 卡帕西(A. Karpathy)、A. 科斯拉(A. Khosla)、M. 伯恩斯坦(M. Bernstein)等。ImageNet 大规模视觉识别挑战赛。《国际计算机视觉杂志》(International journal of computer vision),115(3):211 - 252,2015 年。

[226] P. Russo, F. M. Carlucci, T. Tommasi, and B. Caputo. From source to target and back: symmetric bi-directional adaptive gan. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8099-8108, 2018.
[226] P. 鲁索(P. Russo)、F. M. 卡尔卢奇(F. M. Carlucci)、T. 托马西(T. Tommasi)和 B. 卡普托(B. Caputo)。从源域到目标域再返回:对称双向自适应生成对抗网络。《电气与电子工程师协会计算机视觉与模式识别会议论文集》(Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition),第 8099 - 8108 页,2018 年。

[227] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
[227] A. A. 鲁苏(A. A. Rusu)、N. C. 拉宾诺维茨(N. C. Rabinowitz)、G. 德雅尔丹(G. Desjardins)、H. 索耶(H. Soyer)、J. 柯克帕特里克(J. Kirkpatrick)、K. 卡武库奥卢(K. Kavukcuoglu)、R. 帕斯卡努(R. Pascanu)和 R. 哈德塞尔(R. Hadsell)。渐进式神经网络。预印本 arXiv:1606.04671,2016 年。

[228] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In European conference on computer vision, pages 213-226. Springer, 2010.
[228] K. 塞内科(K. Saenko)、B. 库利斯(B. Kulis)、M. 弗里茨(M. Fritz)和 T. 达雷尔(T. Darrell)。将视觉类别模型适配到新领域。《欧洲计算机视觉会议论文集》(European conference on computer vision),第 213 - 226 页。施普林格出版社(Springer),2010 年。

[229] K. Saito, Y. Ushiku, and T. Harada. Asymmetric tri-training for unsupervised domain adaptation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2988-2997. JMLR, 2017.
[229] K. 斋藤(K. Saito)、Y. 牛久(Y. Ushiku)和 T. 原田(T. Harada)。用于无监督域自适应的非对称三训练。《第 34 届国际机器学习会议论文集 - 第 70 卷》(Proceedings of the 34th International Conference on Machine Learning - Volume 70),第 2988 - 2997 页。《机器学习研究杂志》(JMLR),2017 年。

[230] R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure. In Artificial Intelligence and Statistics, pages 412-419, 2007.
[230] R. 萨拉胡丁诺夫(R. Salakhutdinov)和 G. 辛顿(G. Hinton)。通过保留类邻域结构学习非线性嵌入。《人工智能与统计学会议论文集》(Artificial Intelligence and Statistics),第 412 - 419 页,2007 年。

[231] B. Saleh and A. Elgammal. Large-scale classification of fine-art paintings: Learning the right metric on the right feature. In International Conference on Data Mining Workshops, 2015.
[231] B. 萨利赫(B. Saleh)和 A. 埃尔加马尔(A. Elgammal)。美术绘画的大规模分类:在正确特征上学习正确度量。《国际数据挖掘研讨会论文集》(International Conference on Data Mining Workshops),2015 年。

[232] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. Generate to adapt: Aligning domains using generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8503-8512, 2018.
[232] S. 桑卡纳拉亚南(S. Sankaranarayanan)、Y. 巴拉吉(Y. Balaji)、C. D. 卡斯蒂略(C. D. Castillo)和 R. 切拉帕(R. Chellappa)。生成以适配:使用生成对抗网络对齐域。《电气与电子工程师协会计算机视觉与模式识别会议论文集》(Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition),第 8503 - 8512 页,2018 年。

[233] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry. How does batch normalization help optimization? In Advances in Neural Information Processing Systems, pages 2483-2493, 2018.
[233] S. 桑图尔卡尔(S. Santurkar)、D. 齐普拉斯(D. Tsipras)、A. 伊利亚斯(A. Ilyas)和 A. 马德里(A. Madry)。批量归一化如何助力优化?《神经信息处理系统进展》(Advances in Neural Information Processing Systems),第 2483 - 2493 页,2018 年。

[234] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult. Toward open set recognition. IEEE transactions on pattern analysis and machine intelligence, 35(7):1757-1772, 2012.
[234] W. J. 谢勒(W. J. Scheirer)、A. 德雷森德·罗查(A. de Rezende Rocha)、A. 萨普科塔(A. Sapkota)和 T. E. 博尔特(T. E. Boult)。迈向开放集识别。《电气与电子工程师协会模式分析与机器智能汇刊》(IEEE transactions on pattern analysis and machine intelligence),35(7):1757 - 1772,2012 年。

[235] E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata. Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8247-8255, 2019.
[235] E. 舍恩菲尔德(E. Schonfeld)、S. 易卜拉欣米(S. Ebrahimi)、S. 辛哈(S. Sinha)、T. 达雷尔(T. Darrell)和 Z. 阿卡塔(Z. Akata)。通过对齐变分自编码器实现广义零样本和少样本学习。《电气与电子工程师协会计算机视觉与模式识别会议论文集》(Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition),第 8247 - 8255 页,2019 年。

[236] A. Schyja, A. Hypki, and B. Kuhlenkötter. A modular and extensible framework for real and virtual bin-picking environments. In 2012 IEEE International Conference on Robotics and Automation, pages 5246-5251. IEEE, 2012.
[236] A. 施亚(A. Schyja)、A. 希普基(A. Hypki)和 B. 库伦科特(B. Kuhlenkötter)。一个用于真实和虚拟料箱拾取环境的模块化可扩展框架。《2012 年电气与电子工程师协会国际机器人与自动化会议论文集》(2012 IEEE International Conference on Robotics and Automation),第 5246 - 5251 页。电气与电子工程师协会(IEEE),2012 年。

[237] O. Sener, H. O. Song, A. Saxena, and S. Savarese. Learning transferrable representations for unsupervised domain adaptation. In Advances in Neural Information Processing Systems, pages 2110-2118, 2016.
[237] O. 森纳(O. Sener)、H. O. 宋(H. O. Song)、A. 萨克塞纳(A. Saxena)和 S. 萨瓦雷塞(S. Savarese)。为无监督域自适应学习可迁移表示。《神经信息处理系统进展》(Advances in Neural Information Processing Systems),第 2110 - 2118 页,2016 年。

[238] S. Shankar, V. Piratla, S. Chakrabarti, S. Chaudhuri, P. Jyothi, and S. Sarawagi. Generalizing across domains via cross-gradient training. arXiv preprint arXiv:1804.10745, 2018.
[238] S. 尚卡尔(S. Shankar)、V. 皮拉特(V. Piratla)、S. 查克拉巴蒂(S. Chakrabarti)、S. 乔杜里(S. Chaudhuri)、P. 乔蒂(P. Jyothi)和 S. 萨拉瓦吉(S. Sarawagi)。通过跨梯度训练实现跨域泛化。预印本 arXiv:1804.10745,2018 年。

[239] H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2990-2999, 2017.
[239] H. Shin、J. K. Lee、J. Kim和J. Kim。基于深度生成重放的持续学习。《神经信息处理系统进展》,第2990 - 2999页,2017年。

[240] K. Shmelkov, C. Schmid, and K. Alahari. Incremental learning of object detectors without catastrophic forgetting. In Proceedings of the IEEE International Conference on Computer Vision, pages 3400-3409, 2017.
[240] K. Shmelkov、C. Schmid和K. Alahari。无灾难性遗忘的目标检测器增量学习。《IEEE国际计算机视觉会议论文集》,第3400 - 3409页,2017年。

[241] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2107-2116, 2017.
[241] A. Shrivastava、T. Pfister、O. Tuzel、J. Susskind、W. Wang和R. Webb。通过对抗训练从模拟和无监督图像中学习。《IEEE计算机视觉与模式识别会议论文集》,第2107 - 2116页,2017年。

[242] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe. Animating arbitrary objects via deep motion transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2377-2386, 2019.
[242] A. Siarohin、S. Lathuilière、S. Tulyakov、E. Ricci和N. Sebe。通过深度运动迁移实现任意对象动画化。《IEEE计算机视觉与模式识别会议论文集》,第2377 - 2386页,2019年。

[243] D. L. Silver, Q. Yang, and L. Li. Lifelong machine learning systems: Beyond learning algorithms. In AAAI Spring Symposium: Lifelong Machine Learning, volume 13, page 05, 2013.
[243] D. L. Silver、Q. Yang和L. Li。终身机器学习系统:超越学习算法。《AAAI春季研讨会:终身机器学习》,第13卷,第05页,2013年。

[244] M. Simon, E. Rodner, and J. Denzler. Imagenet pre-trained models with batch normalization. arXiv preprint arXiv:1612.01452, 2016.
[244] M. Simon、E. Rodner和J. Denzler。具有批量归一化的ImageNet预训练模型。预印本arXiv:1612.01452,2016年。

[245] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
[245] K. Simonyan和A. Zisserman。用于大规模图像识别的非常深的卷积网络。《国际学习表征会议》,2015年。

[246] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in neural information processing systems, pages 4077-4087, 2017.
[246] J. Snell、K. Swersky和R. Zemel。用于少样本学习的原型网络。《神经信息处理系统进展》,第4077 - 4087页,2017年。

[247] S. Song, L. Zhang, and J. Xiao. Robot in a room: Toward perfect object recognition in closed environments. CoRR, abs/1507.02703, 2015.
[247] S. Song、L. Zhang和J. Xiao。房间里的机器人:实现封闭环境中的完美目标识别。计算机研究存储库(CoRR),编号abs/1507.02703,2015年。

[248] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[248] K. Soomro、A. R. Zamir和M. Shah。UCF101:一个包含101个人类动作类别的野外视频数据集。预印本arXiv:1212.0402,2012年。

[249] C. Stachniss, O. M. Mozos, and W. Burgard. Speeding-up multi-robot exploration by considering semantic place information. In Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006., pages 1692-1697. IEEE, 2006.
[250] C. Stachniss、O. M. Mozos和W. Burgard。通过考虑语义位置信息加速多机器人探索。《2006年IEEE国际机器人与自动化会议论文集》,2006年。ICRA 2006,第1692 - 1697页。IEEE,2006年。

[250] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks, 32:323-332, 2012.
[251] J. Stallkamp、M. Schlipsing、J. Salmen和C. Igel。人类与计算机的较量:交通标志识别机器学习算法的基准测试。《神经网络》,32:323 - 332,2012年。

[251] B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[252] B. Sun、J. Feng和K. Saenko。令人沮丧的简单领域自适应方法的回归。《第三十届AAAI人工智能会议》,2016年。

[252] Q. Sun, R. Chattopadhyay, S. Panchanathan, and J. Ye. A two-stage weighting framework for multi-source domain adaptation. In Advances in neural information processing systems, pages 505-513, 2011.
[253] Q. Sun、R. Chattopadhyay、S. Panchanathan和J. Ye。用于多源领域自适应的两阶段加权框架。《神经信息处理系统进展》,第505 - 513页,2011年。

[253] Q. Sun, Y. Liu, T.-S. Chua, and B. Schiele. Meta-transfer learning for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 403-412, 2019.
[254] Q. Sun、Y. Liu、T.-S. Chua和B. Schiele。用于少样本学习的元迁移学习。《IEEE计算机视觉与模式识别会议论文集》,第403 - 412页,2019年。

[254] N. Sünderhauf, O. Brock, W. Scheirer, R. Hadsell, D. Fox, J. Leitner, B. Up-croft, P. Abbeel, W. Burgard, M. Milford, et al. The limits and potentials of deep learning for robotics. The International Journal of Robotics Research, 37(4-5):405-420, 2018.
[255] N. Sünderhauf、O. Brock、W. Scheirer、R. Hadsell、D. Fox、J. Leitner、B. Up - croft、P. Abbeel、W. Burgard、M. Milford等。深度学习在机器人领域的局限性与潜力。《国际机器人研究杂志》,37(4 - 5):405 - 420,2018年。

[255] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199-1208, 2018.
[255] F. 宋、Y. 杨、L. 张、T. 向、P. H. 托尔和 T. M. 霍斯佩代尔斯。学会比较:用于少样本学习的关系网络。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 1199 - 1208 页,2018 年。

[256] O. Tasar, Y. Tarabalka, and P. Alliez. Incremental learning for semantic segmentation of large-scale remote sensing data. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(9):3524-3537, 2019.
[256] O. 塔萨尔、Y. 塔拉巴尔卡和 P. 阿利耶。大规模遥感数据语义分割的增量学习。《电气与电子工程师协会应用地球观测与遥感精选专题期刊》,12(9):3524 - 3537,2019 年。

[257] W. Thong, P. Mettes, and C. G. Snoek. Open cross-domain visual search. arXiv preprint arXiv:1911.08621, 2019.
[257] W. 通、P. 梅特斯和 C. G. 斯诺克。开放跨域视觉搜索。预印本 arXiv:1911.08621,2019 年。

[258] S. Thrun and T. M. Mitchell. Lifelong robot learning. Robotics and autonomous systems, 15(1-2):25-46, 1995.
[258] S. 特伦和 T. M. 米切尔。机器人终身学习。《机器人与自主系统》,15(1 - 2):25 - 46,1995 年。

[259] S. Thrun and L. Pratt. Learning to learn. Springer Science & Business Media, 2012.
[259] S. 特伦和 L. 普拉特。学会学习。施普林格科学与商业媒体出版社,2012 年。

[260] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In Proceedings of the IEEE International Conference on Computer Vision, pages 4068-4076, 2015.
[260] E. 曾、J. 霍夫曼、T. 达雷尔和 K. 塞内科。跨领域和任务的同步深度迁移。见《电气与电子工程师协会国际计算机视觉会议论文集》,第 4068 - 4076 页,2015 年。

[261] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167-7176, 2017.
[261] E. 曾、J. 霍夫曼、K. 塞内科和 T. 达雷尔。对抗性判别式领域自适应。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第 7167 - 7176 页,2017 年。

[262] P. Uršič, A. Leonardis, M. Kristan, et al. Part-based room categorization for household service robots. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 2287-2294. IEEE, 2016.
[262] P. 乌尔希奇、A. 伦纳迪斯、M. 克里斯坦等。家用服务机器人基于部件的房间分类。见 2016 年电气与电子工程师协会国际机器人与自动化会议(ICRA),第 2287 - 2294 页。电气与电子工程师协会,2016 年。

[263] S. Valipour, C. Perez, and M. Jagersand. Incremental learning for robot perception through hri. In 2017 IEEE/RSJ Ineternational Conference on Intelligent Robots and Systems (IROS), pages 2772-2777. IEEE, 2017.
[263] S. 瓦利波尔、C. 佩雷斯和 M. 亚格桑德。通过人机交互实现机器人感知的增量学习。见 2017 年电气与电子工程师协会/日本机器人协会国际智能机器人与系统会议(IROS),第 2772 - 2777 页。电气与电子工程师协会,2017 年。

[264] V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz, and Y. Bengio. Manifold mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning, pages 6438-6447. PMLR, 2019.
[264] V. 维尔马、A. 兰姆、C. 贝克汉姆、A. 纳贾菲、I. 米利亚加斯、D. 洛佩斯 - 帕斯和 Y. 本吉奥。流形混合:通过插值隐藏状态获得更好的表示。见国际机器学习会议,第 6438 - 6447 页。机器学习研究会议录,2019 年。

[265] V. K. Verma and P. Rai. A simple exponential family framework for zero-shot learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 792-808. Springer, 2017.
[265] V. K. 维尔马和 P. 拉伊。零样本学习的简单指数族框架。见欧洲机器学习与数据库知识发现联合会议,第 792 - 808 页。施普林格出版社,2017 年。

[266] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630-3638, 2016.
[266] O. 维尼亚尔斯、C. 布伦德尔、T. 利利克拉普、D. 维斯特拉等。用于一次性学习的匹配网络。见《神经信息处理系统进展》,第 3630 - 3638 页,2016 年。

[267] R. Volpi and V. Murino. Addressing model vulnerability to distributional shifts over image transformation sets. In Proceedings of the IEEE International Conference on Computer Vision, pages 7980-7989, 2019.
[267] R. 沃尔皮和 V. 穆里诺。解决模型在图像变换集上分布偏移的脆弱性问题。见《电气与电子工程师协会国际计算机视觉会议论文集》,第 7980 - 7989 页,2019 年。

[268] R. Volpi, H. Namkoong, O. Sener, J. C. Duchi, V. Murino, and S. Savarese. Generalizing to unseen domains via adversarial data augmentation. In Advances in Neural Information Processing Systems, pages 5334-5344, 2018.
[268] R. 沃尔皮、H. 南孔、O. 森纳、J. C. 杜奇、V. 穆里诺和 S. 萨瓦雷塞。通过对抗性数据增强泛化到未见领域。见《神经信息处理系统进展》,第 5334 - 5344 页,2018 年。

[269] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
[269] C. 瓦、S. 布兰森、P. 韦林德、P. 佩罗纳和 S. 贝隆吉。加州理工学院 - 加州大学圣地亚哥分校鸟类 200 - 2011 数据集。2011 年。

[270] M. Wang and W. Deng. Deep visual domain adaptation: A survey. Neurocom-puting, 2018.
[270] M. 王和 W. 邓。深度视觉领域自适应:综述。《神经计算》,2018 年。

[271] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-ucsd birds 200. 2010.
[271] P. 韦林德(P. Welinder)、S. 布兰森(S. Branson)、T. 米塔(T. Mita)、C. 瓦(C. Wah)、F. 施罗夫(F. Schroff)、S. 贝隆吉(S. Belongie)和P. 佩罗纳(P. Perona)。加州理工学院 - 加州大学圣地亚哥分校鸟类数据集200(Caltech - ucsd birds 200)。2010年。

[272] C. Wu, L. Herranz, X. Liu, J. van de Weijer, B. Raducanu, et al. Memory replay gans: Learning to generate new categories without forgetting. In Advances In Neural Information Processing Systems, pages 5962-5972, 2018.
[272] C. 吴(C. Wu)、L. 埃兰兹(L. Herranz)、X. 刘(X. Liu)、J. 范德韦杰尔(J. van de Weijer)、B. 拉杜卡努(B. Raducanu)等人。记忆重放生成对抗网络(Memory replay gans):学习在不遗忘的情况下生成新类别。《神经信息处理系统进展》,第5962 - 5972页,2018年。

[273] J. Wu, H. I. Christensen, and J. M. Rehg. Visual place categorization: Problem, dataset, and algorithm. In 2009 IEEE/RSJ Ineternational Conference on Intelligent Robots and Systems, pages 4763-4770. IEEE, 2009.
[273] J. 吴(J. Wu)、H. I. 克里斯滕森(H. I. Christensen)和J. M. 雷格(J. M. Rehg)。视觉场所分类:问题、数据集和算法。《2009年IEEE/RSJ国际智能机器人与系统会议》,第4763 - 4770页。电气与电子工程师协会(IEEE),2009年。

[274] J. Wu and J. M. Rehg. Centrist: A visual descriptor for scene categorization. IEEE transactions on pattern analysis and machine intelligence, 33(8):1489- 1501, 2010.
[274] J. 吴(J. Wu)和J. M. 雷格(J. M. Rehg)。中心特征描述符(Centrist):一种用于场景分类的视觉描述符。《IEEE模式分析与机器智能汇刊》,33(8):1489 - 1501,2010年。

[275] Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, and Y. Fu. Large scale incremental learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 374-382, 2019.
[275] Y. 吴(Y. Wu)、Y. 陈(Y. Chen)、L. 王(L. Wang)、Y. 叶(Y. Ye)、Z. 刘(Z. Liu)、Y. 郭(Y. Guo)和Y. 傅(Y. Fu)。大规模增量学习。《IEEE计算机视觉与模式识别会议论文集》,第374 - 382页,2019年。

[276] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele. Latent embeddings for zero-shot classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 69-77, 2016.
[276] Y. 西安(Y. Xian)、Z. 阿卡塔(Z. Akata)、G. 夏尔马(G. Sharma)、Q. 阮(Q. Nguyen)、M. 海因(M. Hein)和B. 席勒(B. Schiele)。用于零样本分类的潜在嵌入。《IEEE计算机视觉与模式识别会议论文集》,第69 - 77页,2016年。

[277] Y. Xian, S. Choudhury, Y. He, B. Schiele, and Z. Akata. Semantic projection network for zero-and few-label semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8256-8265, 2019.
[277] Y. 西安(Y. Xian)、S. 乔杜里(S. Choudhury)、Y. 何(Y. He)、B. 席勒(B. Schiele)和Z. 阿卡塔(Z. Akata)。用于零样本和少标签语义分割的语义投影网络。《IEEE计算机视觉与模式识别会议论文集》,第8256 - 8265页,2019年。

[278] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence, 41(9):2251-2265, 2018.
[278] Y. 西安(Y. Xian)、C. H. 兰佩特(C. H. Lampert)、B. 席勒(B. Schiele)和Z. 阿卡塔(Z. Akata)。零样本学习——对好坏优劣的全面评估。《IEEE模式分析与机器智能汇刊》,41(9):2251 - 2265,2018年。

[279] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata. Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5542-5551, 2018.
[279] Y. 西安(Y. Xian)、T. 洛伦茨(T. Lorenz)、B. 席勒(B. Schiele)和Z. 阿卡塔(Z. Akata)。用于零样本学习的特征生成网络。《IEEE计算机视觉与模式识别会议论文集》,第5542 - 5551页,2018年。

[280] Y. Xian, B. Schiele, and Z. Akata. Zero-shot learning-the good, the bad and the ugly. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4582-4591, 2017.
[280] Y. 西安(Y. Xian)、B. 席勒(B. Schiele)和Z. 阿卡塔(Z. Akata)。零样本学习——好坏优劣。《IEEE计算机视觉与模式识别会议论文集》,第4582 - 4591页,2017年。

[281] Y. Xian, S. Sharma, B. Schiele, and Z. Akata. f-vaegan-d2: A feature generating framework for any-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10275-10284, 2019.
[281] Y. 西安(Y. Xian)、S. 夏尔马(S. Sharma)、B. 席勒(B. Schiele)和Z. 阿卡塔(Z. Akata)。f - vaegan - d2:一种用于任意样本学习的特征生成框架。《IEEE计算机视觉与模式识别会议论文集》,第10275 - 10284页,2019年。

[282] J. Xie, W. Hu, S.-C. Zhu, and Y. N. Wu. Learning sparse frame models for natural image patterns. International Journal of Computer Vision, 114(2- 3): 91112,2015 .
[282] J. 谢(J. Xie)、W. 胡(W. Hu)、S. - C. 朱(S. - C. Zhu)和Y. N. 吴(Y. N. Wu)。学习自然图像模式的稀疏框架模型。《国际计算机视觉杂志》,114(2 - 3): 91112,2015

[283] C. Xiong, S. McCloskey, S.-H. Hsieh, and J. J. Corso. Latent domains modeling for visual domain adaptation. In Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
[283] C. 熊(C. Xiong)、S. 麦克洛斯基(S. McCloskey)、S. - H. 谢(S. - H. Hsieh)和J. J. 科尔索(J. J. Corso)。用于视觉领域自适应的潜在领域建模。《第二十八届AAAI人工智能会议》,2014年。

[284] H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision, pages 451-466. Springer, 2016.
[284] H. 徐(H. Xu)和K. 萨内科(K. Saenko)。询问、关注与回答:探索用于视觉问答的问题引导空间注意力。《欧洲计算机视觉会议》,第451 - 466页。施普林格出版社(Springer),2016年。

[285] M. Xu, J. Zhang, B. Ni, T. Li, C. Wang, Q. Tian, and W. Zhang. Adversarial domain adaptation with domain mixup. arXiv preprint arXiv:1912.01805, 2019.
[285] M. 徐(M. Xu)、J. 张(J. Zhang)、B. 倪(B. Ni)、T. 李(T. Li)、C. 王(C. Wang)、Q. 田(Q. Tian)和W. 张(W. Zhang)。基于领域混合的对抗性领域自适应。预印本arXiv:1912.01805,2019年。

[286] R. Xu, Z. Chen, W. Zuo, J. Yan, and L. Lin. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3964-3973, 2018.
[286] R. 徐(R. Xu)、Z. 陈(Z. Chen)、W. 左(W. Zuo)、J. 阎(J. Yan)和L. 林(L. Lin)。深度鸡尾酒网络:具有类别偏移的多源无监督领域自适应。《IEEE计算机视觉与模式识别会议论文集》,第3964 - 3973页,2018年。

[287] Z. Xu, S. Huang, Y. Zhang, and D. Tao. Webly-supervised fine-grained visual categorization via deep domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 40(5):1100-1113, 2016.
[287] 徐Z.、黄S.、张Y.和陶D.。通过深度领域自适应进行网络监督的细粒度视觉分类。《IEEE模式分析与机器智能汇刊》,40(5):1100 - 1113,2016年。

[288] Z. Xu, W. Li, L. Niu, and D. Xu. Exploiting low-rank structure from latent domains for domain generalization. In European Conference on Computer Vision, pages 628-643. Springer, 2014.
[288] 徐Z.、李W.、牛L.和徐D.。从潜在领域挖掘低秩结构以实现领域泛化。见《欧洲计算机视觉会议》,第628 - 643页。施普林格出版社,2014年。

[289] M. Yamada, L. Sigal, and M. Raptis. No bias left behind: Covariate shift adaptation for discriminative 3d pose estimation. In European Conference on Computer Vision, pages 674-687. Springer, 2012.
[289] 山田M.、西加尔L.和拉普蒂斯M.。不留偏差:用于判别式3d姿态估计的协变量偏移自适应。见《欧洲计算机视觉会议》,第674 - 687页。施普林格出版社,2012年。

[290] H. Yang and J. Wu. Object templates for visual place categorization. In Asian Conference on Computer Vision, pages 470-483. Springer, 2012.
[290] 杨H.和吴J.。用于视觉场所分类的对象模板。见《亚洲计算机视觉会议》,第470 - 483页。施普林格出版社,2012年。

[291] J. Yang, R. Yan, and A. G. Hauptmann. Adapting svm classifiers to data with shifted distributions. In Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007), pages 69-76. IEEE, 2007.
[291] 杨J.、阎R.和豪普特曼A. G.。使支持向量机(SVM)分类器适应分布偏移的数据。见《第七届IEEE国际数据挖掘研讨会(ICDMW 2007)》,第69 - 76页。IEEE,2007年。

[292] L. Yang, P. Luo, C. Change Loy, and X. Tang. A large-scale car dataset for fine-grained categorization and verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3973-3981, 2015.
[292] 杨L.、罗P.、洛伊C. C.和唐X.。用于细粒度分类和验证的大规模汽车数据集。见《IEEE计算机视觉与模式识别会议论文集》,第3973 - 3981页,2015年。

[293] Y. Yang and T. M. Hospedales. Multivariate regression on the grassmannian for predicting novel domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5071-5080, 2016.
[293] 杨Y.和霍斯佩代尔斯T. M.。在格拉斯曼流形上进行多元回归以预测新领域。见《IEEE计算机视觉与模式识别会议论文集》,第5071 - 5080页,2016年。

[294] S. K. Yelamarthi, S. K. Reddy, A. Mishra, and A. Mittal. A zero-shot framework for sketch based image retrieval. In European Conference on Computer Vision, pages 316-333. Springer, 2018.
[294] 耶拉马蒂S. K.、雷迪S. K.、米什拉A.和米塔尔A.。基于草图的图像检索的零样本框架。见《欧洲计算机视觉会议》,第316 - 333页。施普林格出版社,2018年。

[295] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4651-4659, 2016.
[295] 尤Q.、金H.、王Z.、方C.和罗J.。基于语义注意力的图像描述。见《IEEE计算机视觉与模式识别会议论文集》,第4651 - 4659页,2016年。

[296] T. Young, D. Hazarika, S. Poria, and E. Cambria. Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine, 13(3):55-75, 2018.
[296] 杨T.、哈扎里卡D.、波里亚S.和坎布里亚E.。基于深度学习的自然语言处理的最新趋势。《IEEE计算智能杂志》,13(3):55 - 75,2018年。

[297] S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC , 2016.
[297] 扎戈鲁伊科S.和科莫达基斯N.。宽残差网络。见BMVC,2016年。

[298] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3712-3722, 2018.
[298] 扎米尔A. R.、萨克斯A.、沈W.、吉巴斯L. J.、马利克J.和萨瓦雷塞S.。任务分类学:解开任务迁移学习。见《IEEE计算机视觉与模式识别会议论文集》,第3712 - 3722页,2018年。

[299] X. Zeng, W. Ouyang, M. Wang, and X. Wang. Deep learning of scene-specific classifier for pedestrian detection. In European Conference on Computer Vision, pages 472-487. Springer, 2014.
[299] 曾X.、欧阳W.、王M.和王X.。用于行人检测的特定场景分类器的深度学习。见《欧洲计算机视觉会议》,第472 - 487页。施普林格出版社,2014年。

[300] F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987-3995. JMLR. org, 2017.
[300] 曾克F.、普尔B.和甘古利S.。通过突触智能进行持续学习。见《第34届国际机器学习会议 - 第70卷论文集》,第3987 - 3995页。JMLR.org,2017年。

[301] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. International Conference on Learning Representations, 2018.
[301] 张H.、西塞M.、多芬Y. N.和洛佩斯 - 帕斯D.。混合样本(mixup):超越经验风险最小化。国际学习表征会议,2018年。

[302] H. Zhang, S. Starke, T. Komura, and J. Saito. Mode-adaptive neural networks for quadruped motion control. ACM Transactions on Graphics (TOG), 37(4):1- 11, 2018.
[302] 张H.、斯塔克S.、小村T.和斋藤J.。用于四足动物运动控制的模式自适应神经网络。《ACM图形学汇刊(TOG)》,37(4):1 - 11,2018年。

[303] Z. Zhang and V. Saligrama. Zero-shot learning via semantic similarity embedding. In Proceedings of the IEEE international conference on computer vision, pages 4166-4174, 2015.
[303] 张Z.和萨利格拉马V. 通过语义相似性嵌入实现零样本学习。见《电气与电子工程师协会国际计算机视觉会议论文集》,第4166 - 4174页,2015年。

[304] Z. Zhang and V. Saligrama. Zero-shot learning via joint latent similarity embedding. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6034-6042, 2016.
[304] 张Z.和萨利格拉马V. 通过联合潜在相似性嵌入实现零样本学习。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第6034 - 6042页,2016年。

[305] Z. Zhang, X. Zhang, C. Peng, X. Xue, and J. Sun. Exfuse: Enhancing feature fusion for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 269-284, 2018.
[305] 张Z.、张X.、彭C.、薛X.和孙J. Exfuse:增强语义分割的特征融合。见《欧洲计算机视觉会议(ECCV)论文集》,第269 - 284页,2018年。

[306] C. Zhao, T. M. Hospedales, F. Stulp, and O. Sigaud. Tensor based knowledge transfer across skill categories for robot control. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 3462-3468, 2017.
[306] 赵C.、霍斯佩代尔斯T. M.、斯图尔普F.和西戈O. 基于张量的跨技能类别知识迁移用于机器人控制。见《第26届国际人工智能联合会议论文集》,第3462 - 3468页,2017年。

[307] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881-2890, 2017.
[307] 赵H.、施J.、齐X.、王X.和贾J. 金字塔场景解析网络。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第2881 - 2890页,2017年。

[308] H. Zhao, S. Zhang, G. Wu, J. ao P. Costeira, J. M. F. Moura, and G. J. Gordon. Multiple source domain adaptation with adversarial learning. In ICLR-WS, 2018.
[308] 赵H.、张S.、吴G.、奥J.、科斯特拉P.、莫拉J. M. F.和戈登G. J. 通过对抗学习实现多源领域自适应。见ICLR研讨会,2018年。

[309] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade 20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633-641, 2017.
[309] 周B.、赵H.、普伊格X.、菲德勒S.、巴里乌索A.和托拉尔瓦A. 通过ADE 20k数据集进行场景解析。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第633 - 641页,2017年。

[310] K. Zhou, Y. Yang, T. Hospedales, and T. Xiang. Deep domain-adversarial image generation for domain generalisation. In Thirty-fourth AAAI conference on artificial intelligence, 2020.
[310] 周K.、杨Y.、霍斯佩代尔斯T.和向T. 用于领域泛化的深度领域对抗图像生成。见第三十四届美国人工智能协会人工智能会议,2020年。

[311] K. Zhou, Y. Yang, T. Hospedales, and T. Xiang. Learning to generate novel domains for domain generalization. In European Conference on Computer Vision, pages 561-578. Springer, 2020.
[311] 周K.、杨Y.、霍斯佩代尔斯T.和向T. 学习为领域泛化生成新领域。见《欧洲计算机视觉会议论文集》,第561 - 578页。施普林格出版社,2020年。

[312] J. Zhuo, S. Wang, S. Cui, and Q. Huang. Unsupervised open domain recognition by semantic discrepancy minimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 750-759, 2019.
[312] 卓J.、王S.、崔S.和黄Q. 通过最小化语义差异实现无监督开放领域识别。见《电气与电子工程师协会计算机视觉与模式识别会议论文集》,第750 - 759页,2019年。